Introduction¶

The objective of this project is to apply data analysis techniques by adopting a data-driven approach for the launch of a hypothetical new mobile application.

For this analysis, two data sources are available. The first is a comprehensive dataset of applications from the Play Store that provides detailed information about each app: from ratings to reviews, from installation counts to prices, including technical specifications. The second source is a collection of user reviews, already processed using sentiment analysis techniques.

Starting from the foundations, I have established a robust framework for data management, with particular attention to cleaning and preparation.

A particularly interesting aspect of the analysis is the examination of competition across different categories. Some categories might appear attractive at first glance, perhaps due to high download numbers, but could prove to be saturated markets with intense competition. Other categories, seemingly smaller, might conceal valuable niches with less competition and, potentially, users more willing to pay for quality products.

The code has been structured following a modular approach, with particular emphasis on clarity and documentation. Each phase of the analysis is organized into well-defined logical components, allowing readers to easily follow the analytical process from raw data to final conclusions. This structure not only makes the code more robust but also facilitates the verification and validation of results obtained at each stage of the analysis.

1. Imports and setup¶

The following code cell prepares the working environment for data analysis. It handles library imports and configures the analysis environment.

The libraries used are:

  • warning: to manage and suppress warning messages that might clutter the analysis output;
  • logging: to monitor the various phases of analysis and capture errors or unexpected behaviors during data processing;
  • typing: enables specification of expected data types for variables and functions, enhancing code robustness and self-documentation. I used it to clearly define function inputs and outputs, such as Dict for dictionaries or Optional for values that could be None;
  • dataclasses: simplifies the creation of data-containing classes;
  • pathlib and os: libraries that work together to manage file operations, such as verifying CSV existence and handling paths across different operating systems;
  • lru_cache from functools: implements a memory cache for frequently used formatting functions, preventing recalculation of previously obtained results and improving performance;
  • ThreadPoolExecutor from concurrent.futures: enables parallelization of resource-intensive data processing operations, enhancing analysis performance;
  • re: for data cleaning and standardization, particularly for extracting numerical information from strings such as app prices and sizes;
  • datetime: essential for temporal data analysis, especially for calculating time intervals;
  • pandas: to create and manipulate dataframes;
  • numpy: complements pandas by providing advanced numerical calculation capabilities, such as metrics computation, statistics, and management of multidimensional arrays necessary for app performance analysis;
  • pandas.api.types: implements strict checks on dataframe column data types, ensuring numerical operations are performed only on appropriate data.

 

For data visualization, I employed two complementary approaches:

  • plotly (with its express modules, graph_objects and subplots) to create interactive and detailed visualizations of Play Store metrics, enabling dynamic data exploration;
  • matplotlib.pyplot and seaborn to generate traditional statistical visualizations, particularly valuable for distribution and correlation analysis.

Warnings are suppressed with warnings.filterwarnings('ignore') to maintain clean output, while the logging system is configured through logging.basicConfig() to track operations and errors.

 

The PlotConfig class, defined using the @dataclass decorator, centralizes visualization configurations.

COLOR_PALETTE maps states like 'primary', 'success', 'warning' to their corresponding hexadecimal color codes, while PLOT_STYLE establishes a consistent style for plots with font, sizes, and basic characteristics.

The __post_init__ method automatically executes after the __init__ method and defines the default values for COLOR_PALETTE and PLOT_STYLE.

I implemented the DataFormatter class to handle data formatting. It contains three static methods, each decorated with @lru_cache(maxsize=1000) for result memorization: format_number(), format_percentage(), and format_currency(), which process numbers with thousand separators, percentages with one decimal place, and monetary values respectively.

The @staticmethod decorator indicates that these are static methods, which can be called directly on the class without instantiation.

 

The DataLoader class forms the core of data loading. The _is_colab_environment() method detects execution on Google Colab, while _setup_visualization_settings() configures pandas and visualization tool settings. The main method load_data() handles CSV file loading through a try-except block, logging any errors via the logger. Initialization occurs with data_loader = DataLoader(), followed by a data loading attempt. A check with if apps_df is None or reviews_df is None verifies successful operation, terminating execution with sys.exit(1) if an error occurs.

 

Regarding visualization libraries, plotly.express and plotly.graph_objects will create interactive graphics, while matplotlib.pyplot and seaborn will generate traditional statistical visualizations.

 

The manipulation and analysis of numerical data will rely on numpy and pandas.

In [ ]:
# Warning management and logging
import warnings
import logging
from typing import Dict, List, Tuple, Optional, Any, NamedTuple, Union
from dataclasses import dataclass, field
from pathlib import Path
import sys
import os
from functools import lru_cache
from concurrent.futures import ThreadPoolExecutor
import re
from datetime import datetime

# Disable warnings
warnings.filterwarnings('ignore')

# Base logging setup
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Basic libraries for data analysis
import pandas as pd
import numpy as np
from pandas.api.types import is_numeric_dtype

# Libraries for visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

@dataclass
class PlotConfig:
    COLOR_PALETTE: Dict[str, str] = None
    PLOT_STYLE: Dict[str, Any] = None

    def __post_init__(self):
        self.COLOR_PALETTE = {
            'primary': '#2c3e50',
            'secondary': '#34495e',
            'success': '#27ae60',
            'warning': '#f39c12',
            'danger': '#c0392b',
            'info': '#3498db'
        }

        self.PLOT_STYLE = {
            'template': 'plotly_white',
            'font_family': 'Arial, sans-serif',
            'title_font_size': 20,
            'title_x': 0.5,
            'showlegend': True
        }

class DataFormatter:

    @staticmethod
    @lru_cache(maxsize=1000)
    def format_number(num: Union[int, float]) -> str:
        """Formats numbers with thousand separators"""
        return f"{num:,.0f}"

    @staticmethod
    @lru_cache(maxsize=1000)
    def format_percentage(num: Union[int, float]) -> str:
        """Formats percentages with 1 decimal place"""
        return f"{num:.1f}%"

    @staticmethod
    @lru_cache(maxsize=1000)
    def format_currency(num: Union[int, float]) -> str:
        """Formats monetary values"""
        return f"${num:,.2f}"

class DataLoader:

    def __init__(self):
        self.plot_config = PlotConfig()

    def _is_colab_environment(self) -> bool:
        try:
            import google.colab
            return True
        except ImportError:
            return False

    def _setup_visualization_settings(self):

        # Pandas settings
        pd.set_option('display.max_columns', None)
        pd.set_option('display.max_rows', 100)
        pd.set_option('display.float_format', lambda x: '%.3f' % x)

        # Matplotlib/seaborn settings
        plt.style.use('default')
        sns.set_theme(style='whitegrid')

    def load_data(self) -> Tuple[Optional[pd.DataFrame], Optional[pd.DataFrame]]:
        try:
            # Setup visualization
            self._setup_visualization_settings()

            # Determine environment and load data
            if self._is_colab_environment():
                logger.info("Detected environment: Google Colab")
                from google.colab import files
                logger.info("Please upload the files 'googleplaystore.csv' and 'googleplaystore_user_reviews.csv'")
                uploaded = files.upload()

            # Check file presence
            required_files = ['googleplaystore.csv', 'googleplaystore_user_reviews.csv']
            for file in required_files:
                if not os.path.exists(file):
                    raise FileNotFoundError(f"File {file} not found in the current directory")

            apps_df = pd.read_csv('googleplaystore.csv')
            reviews_df = pd.read_csv('googleplaystore_user_reviews.csv')

            logger.info(f"Datasets loaded successfully!")
            logger.info(f"apps_df dimensions: {apps_df.shape}")
            logger.info(f"reviews_df dimensions: {reviews_df.shape}")

            return apps_df, reviews_df

        except Exception as e:
            logger.error(f"Error loading data: {str(e)}")
            return None, None

# Loader initialization and data loading
data_loader = DataLoader()
apps_df, reviews_df = data_loader.load_data()

# Verify that data has been loaded correctly
if apps_df is None or reviews_df is None:
    logger.error("Error loading datasets. Check the presence of files or reload them.")
    sys.exit(1)
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving googleplaystore_user_reviews.csv to googleplaystore_user_reviews.csv
Saving googleplaystore.csv to googleplaystore.csv

2. Data reading and validation¶

The second block of code addresses the validation and initial analysis of the two main datasets: apps_df, which contains information about Google Play Store apps, and reviews_df, which contains user reviews.

At the beginning of the cell, I import partial from functools, which I will use to parallelize validation operations. Then I define a series of specialized classes to handle different aspects of data validation.

 

The DatasetMetrics class is defined as a @dataclass and serves as a container for the main metrics of a dataset. It includes fields such as:

  • rows and columns for dimensions;
  • missing_data to track the percentages of missing values per column;
  • duplicates and duplicate_percentage for duplicate records;
  • dtypes for column data types;
  • unique_counts to count unique values in each column.

 

The DataValidator class implements the actual validation logic. Its constructor accepts max_workers to control parallelization.

The _calculate_column_metrics method is decorated with @staticmethod because it doesn't need to access the class instance state. It takes a dataframe (df) and column name (column) as input and returns a dictionary with the following metrics:

  • type: the data type of the column obtained with dtype and converted to string;
  • non_null: the number of non-null values calculated with count();
  • null: the number of null values obtained with isnull().sum();
  • null_perc: the percentage of null values calculated as (null/total)*100 and rounded to 2 decimals;
  • unique_values: the number of unique values in the column obtained with nunique().

The validate_dataset method is the core of validation and uses ThreadPoolExecutor to parallelize calculations across columns. Through executor.map and partial, it applies _calculate_column_metrics to all columns simultaneously. It also calculates the number of duplicates in the dataset using duplicated().sum().

 

I then implemented the DataConsistencyChecker class to verify consistency between the two datasets. Its check_consistency method (decorated with @staticmethod) uses set operations to compare apps present in the datasets:

  • creates two sets with unique() to get the unique apps in each dataset;
  • uses intersection to find apps present in both;
  • uses the - operator to identify apps with reviews but absent from the main dataset.

During execution, various useful statistics on the distribution of apps between datasets are logged. The method keeps track of the total number of apps in each dataset, how many are in common, and generates a warning if it finds apps that have reviews but don't exist in the main dataset. The overall data integrity is also verified, recording the number of unique apps in the Play Store and the total records in the reviews. At the end of the verification, the method returns the original dataframes in a tuple, without making any changes.

 

The InitialAnalyzer class provides a first overview of the data. Its analyze_dataset method separates columns into numeric and categorical using select_dtypes:

  • for numeric columns it uses describe() to obtain basic statistics;
  • for categorical columns it counts unique values and, if there are fewer than 10, calculates the distribution with value_counts(normalize=True).

The main function validate_and_analyze_data effectively manages the entire process:

  1. initializes the necessary classes if not provided;
  2. performs validation of both datasets;
  3. verifies their consistency;
  4. conducts the initial analysis.

Everything is managed in a try-except block to catch and log any errors.

 

The use of logger throughout the code allows detailed tracking of the process and its results, facilitating the identification of any problems in the data.

In [ ]:
from functools import partial


logger = logging.getLogger(__name__)

@dataclass
class DatasetMetrics:
    rows: int
    columns: int
    missing_data: Dict[str, float]
    duplicates: int
    duplicate_percentage: float
    dtypes: Dict[str, str]
    unique_counts: Dict[str, int]

class DataValidator:

    def __init__(self, max_workers: int = 4):
        self.max_workers = max_workers

    @staticmethod
    def _calculate_column_metrics(df: pd.DataFrame, column: str) -> Dict[str, Any]:
        return {
            'type': str(df[column].dtype),
            'non_null': df[column].count(),
            'null': df[column].isnull().sum(),
            'null_perc': (df[column].isnull().sum() / len(df) * 100).round(2),
            'unique_values': df[column].nunique()
        }

    def validate_dataset(self, df: pd.DataFrame, dataset_name: str) -> DatasetMetrics:
        logger.info(f"\nValidating dataset: {dataset_name}")
        logger.info("-" * 50)

        # Calculate base metrics
        rows, cols = df.shape
        logger.info(f"Dimensions: {rows:,} rows, {cols} columns")

        # Calculate metrics for each column in parallel
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            column_metrics = dict(zip(
                df.columns,
                executor.map(partial(self._calculate_column_metrics, df), df.columns)
            ))

        # Calculate duplicates
        duplicates = df.duplicated().sum()
        duplicate_percentage = (duplicates/len(df)*100).round(2)

        logger.info(f"\nDuplicates found: {duplicates:,} ({duplicate_percentage}%)")

        return DatasetMetrics(
            rows=rows,
            columns=cols,
            missing_data={col: metrics['null_perc']
                         for col, metrics in column_metrics.items()},
            duplicates=duplicates,
            duplicate_percentage=duplicate_percentage,
            dtypes={col: metrics['type']
                   for col, metrics in column_metrics.items()},
            unique_counts={col: metrics['unique_values']
                         for col, metrics in column_metrics.items()}
        )

class DataConsistencyChecker:

    @staticmethod
    def check_consistency(apps_df: pd.DataFrame,
                         reviews_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        logger.info("\nVerifying consistency between datasets")
        logger.info("-" * 50)

        # Check apps present in both datasets
        apps_in_store = set(apps_df['App'].unique())
        apps_in_reviews = set(reviews_df['App'].unique())

        common_apps = apps_in_store.intersection(apps_in_reviews)
        missing_apps = apps_in_reviews - apps_in_store

        logger.info(f"Apps in Play Store: {len(apps_in_store):,}")
        logger.info(f"Apps with reviews: {len(apps_in_reviews):,}")
        logger.info(f"Apps in common: {len(common_apps):,}")

        if missing_apps:
            logger.warning(
                f"\nWarning: {len(missing_apps):,} apps have reviews "
                "but are not in the main dataset"
            )

        # Verify integrity
        logger.info("\nVerifying data integrity:")
        logger.info(f"- Unique apps in Play Store: {apps_df['App'].nunique():,}")
        logger.info(f"- Total review records: {len(reviews_df):,}")

        return apps_df, reviews_df

class InitialAnalyzer:

    @staticmethod
    def analyze_dataset(df: pd.DataFrame, dataset_name: str) -> None:
        logger.info(f"\nInitial analysis: {dataset_name}")
        logger.info("-" * 50)

        # Analyze numeric columns
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        if len(numeric_cols) > 0:
            logger.info("\nNumeric column statistics:")
            logger.info(df[numeric_cols].describe().round(2))

        # Analyze categorical columns
        categorical_cols = df.select_dtypes(include=['object']).columns
        if len(categorical_cols) > 0:
            logger.info("\nCategorical column statistics:")
            for col in categorical_cols:
                unique_vals = df[col].nunique()
                logger.info(f"\n{col}:")
                logger.info(f"- Unique values: {unique_vals:,}")
                if unique_vals < 10:
                    dist = df[col].value_counts(normalize=True).head()
                    logger.info(f"Distribution:\n{dist.round(3)}")

def validate_and_analyze_data(apps_df: pd.DataFrame,
                            reviews_df: pd.DataFrame,
                            validator: Optional[DataValidator] = None,
                            consistency_checker: Optional[DataConsistencyChecker] = None,
                            initial_analyzer: Optional[InitialAnalyzer] = None) -> Tuple[pd.DataFrame, pd.DataFrame]:
    logger.info("=== BEGINNING DATA VALIDATION AND ANALYSIS ===")

    # Initialize components if not provided
    validator = validator or DataValidator()
    consistency_checker = consistency_checker or DataConsistencyChecker()
    initial_analyzer = initial_analyzer or InitialAnalyzer()

    try:
        # Dataset validation
        apps_metrics = validator.validate_dataset(apps_df, "Google Play Store Apps")
        reviews_metrics = validator.validate_dataset(reviews_df, "App Reviews")

        # Consistency check
        apps_df, reviews_df = consistency_checker.check_consistency(apps_df, reviews_df)

        # Initial analysis
        initial_analyzer.analyze_dataset(apps_df, "Google Play Store Apps")
        initial_analyzer.analyze_dataset(reviews_df, "App Reviews")

        return apps_df, reviews_df

    except Exception as e:
        logger.error(f"Error during validation and analysis: {str(e)}")
        raise

# Execute validation and analysis
apps_df, reviews_df = validate_and_analyze_data(apps_df, reviews_df)
WARNING:__main__:
Warning: 54 apps have reviews but are not in the main dataset

3. Data cleaning and preparation¶

The third code cell focuses on data cleaning and preparation, a crucial phase for ensuring that subsequent analysis is based on accurate and well-structured information.

  • clean_size() converts app sizes to megabytes. It handles cases like 'Varies with device' and automatically converts KB to MB by dividing by 1024 when necessary. It uses regular expressions with re.sub(r'[^0-9.]', '') to extract only numbers from the string;
  • clean_price() standardizes price values in numerical format, converting strings like '$4.99' to decimal values (4.99) and handling special cases like 'Free' which are transformed to 0.0;
  • clean_installs() converts installation numbers to integers, removing characters like commas and the '+' sign (e.g., '1,000,000+' becomes 1000000);
  • clean_android_version() extracts and normalizes the Android version, using a regular expression to find the main numerical format (e.g., from "Android 4.0.3 or up" it extracts 4.0).

 

Two specialized classes derive from this base class. The first is AppsDataCleaner, dedicated to cleaning the applications dataset. This class implements _parallel_clean_row(), which cleans a single row by applying cleaning methods and adding new clean columns, and clean_apps_dataset(), which coordinates the entire process using ThreadPoolExecutor to parallelize cleaning and improve performance.

During the app cleaning process, several advanced operations are performed:

  • data type conversion using pd.to_numeric() and pd.to_datetime();
  • removal of rows with missing values through dropna();
  • feature engineering with the creation of the Days_Since_Update variable, which indicates the number of days elapsed since the last update of each record, providing temporal information useful for data analysis;
  • categorization of continuous variables like price and installations using pd.cut();
  • calculation of market metrics such as Market_Share and Category_Share.

 

The second derived class, ReviewsDataCleaner, is specific to the reviews dataset. Although simpler, it handles several tasks:

  • cleaning missing values in sentiment columns;
  • converting data types for polarity and subjectivity metrics;
  • creating the Review_Length feature to analyze review length;
  • categorizing sentiment polarity into "Negative", "Neutral", and "Positive".

I should note that these functionalities are not fully utilized in subsequent analyses. Initially, I had planned to develop visualizations and insights based on sentiment and user reviews, but during analysis I found that these data did not produce sufficiently significant or interpretable results for the project objectives. I therefore decided to focus on analyzing app metrics (rating, installations, price) which provided more concrete insights for identifying market opportunities. Despite reviews not being used in subsequent analyses, I have maintained this cleaning code section for methodological completeness and possible future investigations.

Finally, the main function clean_datasets() manages the entire cleaning process. It initializes the necessary cleaners, performs the cleaning of both datasets, and records detailed statistics on the transformations performed. Everything is encapsulated in a try-except block to handle any errors during the process.

 

The final result is two clean and enriched DataFrames, apps_clean and reviews_clean, ready for subsequent exploratory analyses.

In [ ]:
logger = logging.getLogger(__name__)

@dataclass
class CleaningReport:
    original_rows: int
    cleaned_rows: int
    removed_rows: int
    missing_before: Dict[str, int]
    missing_after: Dict[str, int]
    cleaning_steps: list[str]

class DataCleaner:

    @staticmethod
    def clean_size(size: str) -> Optional[float]:
        """Converts the app size to MB"""
        if pd.isna(size) or size == 'Varies with device':
            return np.nan
        try:
            size_str = str(size).strip().upper()
            multiplier = 1/1024 if 'K' in size_str else 1
            return float(re.sub(r'[^0-9.]', '', size_str)) * multiplier
        except (ValueError, AttributeError):
            return np.nan

    @staticmethod
    def clean_price(price: str) -> float:
        """Converts the price to numeric value"""
        if pd.isna(price) or price in ['Free', '0', 'Everyone']:
            return 0.0
        try:
            return float(re.sub(r'[^0-9.]', '', str(price)))
        except (ValueError, AttributeError):
            return 0.0

    @staticmethod
    def clean_installs(installs: str) -> int:
        """Converts the number of installations to numeric value"""
        if pd.isna(installs):
            return 0
        try:
            return int(str(installs).replace(',', '').replace('+', '').strip())
        except ValueError:
            return 0

    @staticmethod
    def clean_android_version(version: str) -> Optional[float]:
        """Extracts and normalizes the Android version"""
        if pd.isna(version):
            return np.nan
        try:
            match = re.search(r'(\d+\.?\d?)', str(version))
            return round(float(match.group(1)), 1) if match else np.nan
        except (ValueError, AttributeError):
            return np.nan

class AppsDataCleaner(DataCleaner):

    def __init__(self, max_workers: int = 4):
        self.max_workers = max_workers
        self.cleaning_steps = []

    def _parallel_clean_row(self, row: pd.Series) -> pd.Series:
        """Cleans a single row of the dataset in parallel"""
        row = row.copy()
        row['Size_MB'] = self.clean_size(row['Size'])
        row['Price_Clean'] = self.clean_price(row['Price'])
        row['Installs_Clean'] = self.clean_installs(row['Installs'])
        row['Android_Ver_Clean'] = self.clean_android_version(row['Android Ver'])
        return row

    def clean_apps_dataset(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, CleaningReport]:
        """Cleans the applications dataset"""
        logger.info("Cleaning applications dataset in progress...")
        df_clean = df.copy()
        original_rows = len(df_clean)
        missing_before = df_clean.isnull().sum().to_dict()

        # Parallel row cleaning
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            cleaned_rows = list(executor.map(self._parallel_clean_row, [row for _, row in df_clean.iterrows()]))
        df_clean = pd.DataFrame(cleaned_rows, index=df_clean.index)

        # Data type conversion
        df_clean['Rating'] = pd.to_numeric(df_clean['Rating'], errors='coerce')
        df_clean['Reviews'] = pd.to_numeric(df_clean['Reviews'], errors='coerce')
        df_clean['Last Updated'] = pd.to_datetime(df_clean['Last Updated'], errors='coerce')

        # Remove rows with critical missing values
        df_clean = df_clean.dropna(subset=['Android_Ver_Clean'])

        # Feature engineering
        df_clean['Days_Since_Update'] = (pd.Timestamp.now() - df_clean['Last Updated']).dt.days

        # Categorization
        df_clean['Price_Category'] = pd.cut(
            df_clean['Price_Clean'],
            bins=[-np.inf, 0, 0.99, 2.99, 4.99, np.inf],
            labels=['Free', 'Very Low', 'Low', 'Medium', 'Premium']
        )

        df_clean['Install_Category'] = pd.cut(
            df_clean['Installs_Clean'],
            bins=[0, 1000, 100000, 1000000, 10000000, np.inf],
            labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
        )

        # Calculate market metrics
        total_installs = df_clean['Installs_Clean'].sum()
        df_clean['Market_Share'] = df_clean['Installs_Clean'] / total_installs

        df_clean['Category_Share'] = df_clean.groupby('Category')['Installs_Clean'].transform(
            lambda x: x / x.sum()
        )

        # Cleaning report
        cleaning_report = CleaningReport(
            original_rows=original_rows,
            cleaned_rows=len(df_clean),
            removed_rows=original_rows - len(df_clean),
            missing_before=missing_before,
            missing_after=df_clean.isnull().sum().to_dict(),
            cleaning_steps=self.cleaning_steps
        )

        return df_clean, cleaning_report

class ReviewsDataCleaner(DataCleaner):

    def clean_reviews_dataset(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, CleaningReport]:
        """Cleans the reviews dataset"""
        logger.info("Cleaning reviews dataset in progress...")
        df_clean = df.copy()
        original_rows = len(df_clean)
        missing_before = df_clean.isnull().sum().to_dict()

        # Clean missing values
        df_clean = df_clean.dropna(subset=['Sentiment', 'Sentiment_Polarity'])

        # Data type conversion
        df_clean['Sentiment_Polarity'] = pd.to_numeric(df_clean['Sentiment_Polarity'], errors='coerce')
        df_clean['Sentiment_Subjectivity'] = pd.to_numeric(df_clean['Sentiment_Subjectivity'], errors='coerce')

        # Feature engineering
        df_clean['Review_Length'] = df_clean['Translated_Review'].str.len()

        # Sentiment categorization
        df_clean['Sentiment_Category'] = pd.cut(
            df_clean['Sentiment_Polarity'],
            bins=[-1, -0.33, 0.33, 1],
            labels=['Negative', 'Neutral', 'Positive']
        )

        # Cleaning report
        cleaning_report = CleaningReport(
            original_rows=original_rows,
            cleaned_rows=len(df_clean),
            removed_rows=original_rows - len(df_clean),
            missing_before=missing_before,
            missing_after=df_clean.isnull().sum().to_dict(),
            cleaning_steps=[]
        )

        return df_clean, cleaning_report

def clean_datasets(apps_df: pd.DataFrame,
                  reviews_df: pd.DataFrame,
                  apps_cleaner: Optional[AppsDataCleaner] = None,
                  reviews_cleaner: Optional[ReviewsDataCleaner] = None) -> Tuple[pd.DataFrame, pd.DataFrame]:
    logger.info("=== BEGINNING CLEANING PROCESS ===")

    # Initialize cleaners if not provided
    apps_cleaner = apps_cleaner or AppsDataCleaner()
    reviews_cleaner = reviews_cleaner or ReviewsDataCleaner()

    try:
        # Clean app dataset
        apps_clean, apps_report = apps_cleaner.clean_apps_dataset(apps_df)
        logger.info(f"\nApps dataset cleaning completed:")
        logger.info(f"Original rows: {apps_report.original_rows:,}")
        logger.info(f"Rows after cleaning: {apps_report.cleaned_rows:,}")
        logger.info(f"Removed rows: {apps_report.removed_rows:,}")

        # Clean reviews dataset
        reviews_clean, reviews_report = reviews_cleaner.clean_reviews_dataset(reviews_df)
        logger.info(f"\nReviews dataset cleaning completed:")
        logger.info(f"Original rows: {reviews_report.original_rows:,}")
        logger.info(f"Rows after cleaning: {reviews_report.cleaned_rows:,}")
        logger.info(f"Removed rows: {reviews_report.removed_rows:,}")

        return apps_clean, reviews_clean

    except Exception as e:
        logger.error(f"Error during data cleaning: {str(e)}")
        raise

# Execute cleaning
apps_clean, reviews_clean = clean_datasets(apps_df, reviews_df)

4. Exploratory data analysis¶

This code cell focuses on the exploratory analysis of cleaned app data to obtain useful market information, and defines classes and functions to perform statistical analysis and generate visualizations. The first part of the code defines two classes using the @dataclass decorator:

  1. CategoryStats: a class used to store statistical information about app categories. It includes fields such as the number of apps in a category, the average rating, the percentage of paid apps, the average number of installations, and the average size of apps;
  2. MarketAnalysis: a class used to contain the results of market analysis. It includes a pandas dataframe with statistics by category, a dictionary with price-related statistics, a pandas dataframe with market competitiveness analysis, and a list to store plotly figures generated by the analysis.

 

The main class of this section is MarketAnalyzer, which takes the cleaned apps_df dataframe as input during initialization. The class includes methods for calculating statistics, creating visualizations, and performing different types of market analysis. The __init__ method initializes MarketAnalyzer by filtering rows where Category is '1.9' and sets the number of threads that will be used for parallel calculations via the max_workers argument.

The _calculate_category_stats method is a helper method, decorated with @lru_cache(maxsize=None), which calculates and returns CategoryStats for a given category. The @lru_cache decorator stores the results of this method for efficiency, avoiding redundant calculations when it is called multiple times with the same category.

The analyze_category_distribution method analyzes the distribution of apps across different categories. It uses a ThreadPoolExecutor to parallelize the calculation of statistics by category, creates a pandas dataframe with the calculated statistics, and generates a bar chart using plotly.express to visualize the distribution of apps across categories, using the average rating as a color shade.

The analyze_price_distribution method analyzes the distribution of prices for paid apps. It filters apps_df to include only paid apps, defines price ranges and related labels to group prices into categories, calculates price statistics (such as the total number of apps, the number of paid apps, the percentage of paid apps, and the average, median, and maximum price of paid apps), aggregates data by price range, and generates a bar chart with plotly.graph_objects which shows the price distribution using the percentage of apps for each price range on the y-axis and price ranges on the x-axis. The bars are colored based on the average rating within each price range.

The analyze_market_competition method aims to visually represent the competitiveness of the app market through a map that relates various key metrics. To build this map, the code begins by grouping the apps_df dataframe by category ("Category") and calculating four fundamental metrics for each:

  • the number of apps, which directly indicates the level of competition;
  • the average rating, which reflects the quality perceived by users;
  • the percentage of paid apps, which represents the propensity to pay in that category;
  • the average number of installations, which offers a measure of market breadth.

The payment propensity (paid_perc) is calculated with a lambda expression 'Price_Clean': lambda x: (x > 0).mean() * 100. This formula transforms app prices into boolean values (True for paid apps, False for free apps), calculates their mean (obtaining the proportion of paid apps), and multiplies by 100 to express the result as a percentage.

Specifically:

  1. (x > 0) creates an array of boolean values where True represents apps with a price greater than zero (paid apps) and False those that are free;
  2. .mean() calculates the mean of these boolean values, which is equivalent to the proportion of paid apps (a value between 0 and 1);
  3. * 100 converts this proportion into a percentage.

This metric is important because it provides an indication of users' willingness to pay for apps in that category. A higher percentage suggests that:

  • users in that category are more willing to pay for content and features;
  • there is a precedent for direct monetization that new entrants could exploit;
  • the "premium" business model (direct payment) might be more easily accepted compared to freemium or advertising-based models.

Subsequently, a "market size index" (market_size_index) is calculated based on the number of apps in each category, normalized between 0 and 1. This index represents the relative size of each category compared to others, where a category with more apps will have a higher index, indicating a larger and potentially more competitive market.

The actual visualization is created with a scatter plot using plotly.graph_objects. Each point on the graph represents a category of apps, positioned according to two main dimensions:

  • the x-axis represents the number of apps in the category, or the level of competition. The further right a point is, the greater the number of apps in that category and therefore the higher the competition;
  • the y-axis represents the average rating of apps in the category. The higher a point is, the greater the perceived quality of apps in that category.

The size of each point is proportional to the market size index calculated previously. Therefore, larger points indicate categories with a potentially larger market. The color of each point represents the average rating of apps in that category, using a color scale from red (low rating) to green (high rating), providing a visual redundancy that reinforces the information on the y-axis and allows quick identification of categories with apps perceived as superior quality.

Finally, the method calculates an opportunity_score for each category, based on a weighted combination of three factors:

  • 40% from the average rating, rewarding categories with higher quality apps;
  • 30% from the inverse of the market size index, favoring less competitive categories;
  • 30% from the percentage of paid apps, valuing categories where there is a culture of purchasing.

The idea behind this score is that categories with high ratings, low competition, and a high percentage of paid apps potentially represent the best market opportunities, balancing quality, ease of entry, and potential for direct monetization. The visualization has been further enriched with an annotation showing the top 5 categories based on this opportunity score.

 

The perform_exploratory_analysis function manages the entire process of exploratory data analysis. It initializes MarketAnalyzer if it is not provided in the function arguments, calls the analysis methods of MarketAnalyzer to obtain results and related visualizations, calculates the top 5 market opportunities based on the opportunity score, and returns a MarketAnalysis object containing the results and all generated charts.

Finally, the code performs the exploratory analysis by calling perform_exploratory_analysis - using the cleaned dataframes apps_clean and reviews_clean - and displays the generated figures by iterating through the figures in the market_analysis object and using the show() method to display each plotly chart in the output.

 

Interpretation of results¶

 

Distribution of apps by category and average rating¶

The first chart provides an overview of the distribution of apps in the Google Play Store. What immediately stands out is the predominance of the "FAMILY" category, which hosts almost 2000 applications, representing the largest market segment. In second place, we find "GAME" with more than 1000 apps, which reflects the significant popularity of mobile gaming and its ability to generate substantial revenue. Further behind, we find "TOOLS" with about 750 apps, a category that encompasses various utilities and tools.

The visualization reveals an extremely heterogeneous app ecosystem, where a few categories gather most of the applications, while many others represent niches with a much more limited numerical presence. Categories like "EVENTS", "BEAUTY", "PARENTING", and "WEATHER" have a very limited presence, suggesting potential opportunities in less saturated markets.

Particularly interesting is the relationship between the number of apps and the average rating, highlighted by the coloring system. Categories with fewer applications tend to have higher average ratings (displayed in green), as in the case of "EVENTS", "EDUCATION", and "BOOKS_AND_REFERENCE". This phenomenon could indicate that in less crowded markets, it's easier to emerge with quality products that more readily meet user expectations. Conversely, highly competitive categories like "DATING" and "CASINO" show lower average ratings (in orange-red), suggesting greater difficulty in standing out and fully meeting user expectations in typically saturated markets.

 

Price distribution of paid apps¶

The second chart explores direct monetization strategies through the price distribution of paid apps. The market shows a clear preference for pricing in the $1-2.99 range, which encompasses 36.4% of paid apps. This data suggests a balance point between accessibility for users and perceived value by developers.

It's interesting to note how the distribution doesn't follow a linear decreasing trend: the $0-1 range (20.2%) is more populated than the $3-4.99 range (19.9%), while there's a sharp decline in the $5-9.99 range (11.3%) before a slight rise in the premium category over $10 (12.1%). This pattern reflects different monetization and positioning strategies: many developers opt for very low prices aiming for volume, while others choose premium positioning with high prices targeting users willing to pay for exclusive or very specific features.

The coloring of the bars in the chart offers an additional analytical dimension. Examining the color gradients, it emerges that applications in the lower price ranges present higher average ratings, highlighting an effective correspondence between the price point and user satisfaction. This phenomenon suggests that consumer expectations at these price levels are generally met by the user experience. Conversely, applications positioned in the premium range (over $10) show lower average ratings, which could indicate that users are more critical when paying premium prices and their expectations are harder to meet.

 

Competitive market map¶

The competitive map represents a sophisticated analytical tool that offers a multidimensional view of the app market. In this scatter plot, each category is positioned based on two fundamental metrics: the number of applications (X-axis) which indicates the level of competition, and the average rating (Y-axis) which reflects user satisfaction.

The competitive landscape appears stratified, with "FAMILY" dominating in quantitative terms with about 2000 apps (extreme right of the chart), followed by "GAME" with about 1000 apps and "TOOLS" with about 750 apps, as was already evident from the chart "Distribution of Apps by Category and Average Rating". The size of the circles, proportional to the market size index, visually amplifies this hierarchy, highlighting the significant weight of these categories in the overall ecosystem.

Particularly interesting is the vertical distribution: categories like "EVENTS", "EDUCATION", and "ART_AND_DESIGN" are located in the upper part of the chart with average ratings above 4.3, suggesting a high perceived quality. At the opposite extreme, categories like "DATING" show more modest ratings, indicating greater difficulty in meeting user expectations.

The analysis is significantly enriched by the information box in the bottom right of the chart, which reveals the top 5 market opportunities according to the opportunity score:

  1. MEDICAL emerges as the most promising opportunity with a score of 8.67, combining a good rating (4.2), moderate competition (439 apps), and a high propensity to pay (22.6% of paid apps);
  2. PERSONALIZATION positions itself immediately after with 8.63, thanks to a slightly higher rating (4.3) and a less crowded market (352 apps), maintaining a high percentage of paid apps (22.2%);
  3. BOOKS_AND_REFERENCE presents a score of 6.06, with an excellent rating (4.3) and low competition (200 apps), although it has a lower propensity to pay (13.5%);
  4. WEATHER represents an interesting niche with a score of 5.15, combining a good rating (4.2) with minimal competition (only 57 apps) and a decent propensity to pay (10.5%);
  5. TOOLS, despite high competition (744 apps), maintains a respectable score of 4.57 thanks to a decent rating (4.0) and the presence of a good segment of users willing to pay (9.3%).

This ranking reveals how the best opportunities are not necessarily the categories located in the upper left corner of the chart (high rating, low competition). The propensity to pay plays an important role, making categories like MEDICAL and PERSONALIZATION particularly attractive despite not being the least competitive or those with the highest ratings in absolute terms.

In [ ]:
logger = logging.getLogger(__name__)

@dataclass
class CategoryStats:
    num_apps: int
    avg_rating: float
    paid_perc: float
    avg_installs: float
    avg_size: float

@dataclass
class MarketAnalysis:
    category_stats: pd.DataFrame
    price_stats: Dict[str, float]
    market_analysis: pd.DataFrame
    figures: List[go.Figure] = field(default_factory=list)

class MarketAnalyzer:

    def __init__(self, apps_df: pd.DataFrame, max_workers: int = 4):
        self.apps_df = apps_df[apps_df['Category'] != '1.9'].copy()
        self.max_workers = max_workers

    @lru_cache(maxsize=None)
    def _calculate_category_stats(self, category: str) -> CategoryStats:
        cat_data = self.apps_df[self.apps_df['Category'] == category]
        return CategoryStats(
            num_apps=len(cat_data),
            avg_rating=cat_data['Rating'].mean(),
            paid_perc=(cat_data['Price_Clean'] > 0).mean() * 100,
            avg_installs=cat_data['Installs_Clean'].mean(),
            avg_size=cat_data['Size_MB'].mean()
        )

    def analyze_category_distribution(self) -> Tuple[pd.DataFrame, go.Figure]:
        # Parallel calculation of statistics by category
        categories = self.apps_df['Category'].unique()
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            stats = list(executor.map(self._calculate_category_stats, categories))

        # DataFrame creation
        category_stats = pd.DataFrame({
            'Category': categories,
            'num_apps': [s.num_apps for s in stats],
            'avg_rating': [s.avg_rating for s in stats],
            'paid_perc': [s.paid_perc for s in stats],
            'avg_installs': [s.avg_installs for s in stats],
            'avg_size': [s.avg_size for s in stats]
        }).round(2)

        # Optimized chart creation
        fig = px.bar(
            category_stats,
            x='Category',
            y='num_apps',
            color='avg_rating',
            title='Distribution of apps by category and average rating',
            labels={
                'num_apps': 'Number of apps',
                'Category': 'Category',
                'avg_rating': 'Average rating'
            },
            color_continuous_scale=[[0, '#B30000'], [0.4, '#FF0000'],
                                  [0.6, '#FFA500'], [0.75, '#2ECC40'],
                                  [1, '#00B300']],
            range_color=[3.2, 4.8]
        )

        fig.update_layout(
            xaxis_tickangle=-45,
            showlegend=True,
            height=600,
            title_x=0.5,
            font=dict(family="Arial", size=12),
            margin=dict(t=100, l=50, r=50, b=100)
        )

        return category_stats.set_index('Category'), fig

    def analyze_price_distribution(self) -> Tuple[Dict[str, float], go.Figure]:
        df_paid = self.apps_df[self.apps_df['Price_Clean'] > 0].copy()

        price_ranges = [0, 1, 2.99, 4.99, 9.99, float('inf')]
        price_labels = ['0-1$', '1-2.99$', '3-4.99$', '5-9.99$', '10$+']

        df_paid['price_range'] = pd.cut(df_paid['Price_Clean'],
                                      bins=price_ranges,
                                      labels=price_labels)

        # Calculate price statistics
        price_stats = {
            'total_apps': len(self.apps_df),
            'paid_apps': len(df_paid),
            'paid_percentage': (len(df_paid) / len(self.apps_df)) * 100,
            'avg_price': df_paid['Price_Clean'].mean(),
            'median_price': df_paid['Price_Clean'].median(),
            'max_price': df_paid['Price_Clean'].max()
        }

        # Data aggregation for the chart
        price_distribution = df_paid.groupby('price_range').agg({
            'App': 'count',
            'Rating': 'mean'
        }).reset_index()

        price_distribution['percentage'] = (
            price_distribution['App'] / len(df_paid)
        ) * 100

        # Chart creation
        fig = go.Figure(data=[
            go.Bar(
                x=price_distribution['price_range'],
                y=price_distribution['percentage'],
                marker=dict(
                    color=price_distribution['Rating'],
                    colorscale=[[0, '#B30000'], [0.4, '#FF0000'],
                               [0.6, '#FFA500'], [0.75, '#2ECC40'],
                               [1, '#00B300']],
                    colorbar=dict(
                        title="Average rating",
                        titleside="right",
                        xpad=30,
                        len=0.9,
                        thickness=20
                    ),
                    cmin=3.2,
                    cmax=4.8
                ),
                text=price_distribution['percentage'].round(1).astype(str) + '%',
                textposition='outside'
            )
        ])

        fig.update_layout(
            title='Price distribution of paid apps',
            title_x=0.5,
            xaxis_title='Price range',
            yaxis_title='Percentage of apps (%)',
            height=500,
            yaxis_range=[0, max(price_distribution['percentage']) * 1.1],
            bargap=0.2,
            font=dict(family="Arial", size=12)
        )

        return price_stats, fig

    def analyze_market_competition(self) -> Tuple[pd.DataFrame, go.Figure]:
        """Analyzes market competitiveness"""
        market_analysis = self.apps_df.groupby('Category').agg({
            'App': 'count',
            'Rating': 'mean',
            'Price_Clean': lambda x: (x > 0).mean() * 100,
            'Installs_Clean': 'mean'
        }).round(2)

        market_analysis.columns = ['num_apps', 'avg_rating', 'paid_perc', 'avg_installs']

        # Calculate market size index
        market_analysis['market_size_index'] = (
            (market_analysis['num_apps'] - market_analysis['num_apps'].min()) /
            (market_analysis['num_apps'].max() - market_analysis['num_apps'].min())
        )

        # Chart creation
        fig = go.Figure(data=[
            go.Scatter(
                x=market_analysis['num_apps'],
                y=market_analysis['avg_rating'],
                mode='markers+text',
                text=market_analysis.index,
                textposition='top center',
                marker=dict(
                    size=market_analysis['market_size_index'] * 50,
                    color=market_analysis['avg_rating'],
                    colorscale=[[0, '#B30000'], [0.4, '#FF0000'],
                               [0.6, '#FFA500'], [0.75, '#2ECC40'],
                               [1, '#00B300']],
                    colorbar=dict(
                        title="Average rating",
                        titleside="right",
                        xpad=30,
                        len=0.9,
                        thickness=20
                    ),
                    cmin=3.2,
                    cmax=4.8
                )
            )
        ])

        fig.update_layout(
            title='Competitive market map',
            title_x=0.5,
            xaxis_title='Number of apps (competition)',
            yaxis_title='Average rating',
            height=600,
            showlegend=False,
            font=dict(family="Arial", size=12)
        )

        # Calculate opportunity score
        market_analysis['opportunity_score'] = (
            market_analysis['avg_rating'] * 0.4 +
            (1 - market_analysis['market_size_index']) * 0.3 +
            market_analysis['paid_perc'] * 0.3
        )

        # Add annotation with best opportunities
        top_opportunities = market_analysis.nlargest(5, 'opportunity_score')
        top_text = "<b>Top 5 market opportunities:</b><br>"
        for cat in top_opportunities.index:
            score = market_analysis.loc[cat, 'opportunity_score']
            rating = market_analysis.loc[cat, 'avg_rating']
            apps = market_analysis.loc[cat, 'num_apps']
            paid = market_analysis.loc[cat, 'paid_perc']
            top_text += f"<b>{cat}</b>: score {score:.2f} (rating {rating:.1f}, apps {apps}, propensity {paid:.1f}%)<br>"

        fig.add_annotation(
            x=0.99,
            y=0.01,
            xref="paper",
            yref="paper",
            xanchor="right",
            yanchor="bottom",
            text=top_text,
            showarrow=False,
            font=dict(size=11),
            align="left",
            bgcolor="rgba(255, 255, 255, 0.9)",
            bordercolor="black",
            borderwidth=1,
            borderpad=6
        )

        return market_analysis, fig

def perform_exploratory_analysis(apps_df: pd.DataFrame,
                               reviews_df: Optional[pd.DataFrame] = None,
                               analyzer: Optional[MarketAnalyzer] = None) -> MarketAnalysis:
    logger.info("=== BEGINNING EXPLORATORY ANALYSIS ===")

    analyzer = analyzer or MarketAnalyzer(apps_df)
    figures = []

    try:
        # 1. Category distribution analysis
        logger.info("\n1. Category distribution analysis")
        category_stats, category_fig = analyzer.analyze_category_distribution()
        figures.append(category_fig)

        # 2. Price analysis
        logger.info("\n2. Price distribution analysis")
        price_stats, price_fig = analyzer.analyze_price_distribution()
        figures.append(price_fig)

        # 3. Competitive analysis
        logger.info("\n3. Competitive market analysis")
        market_analysis, market_fig = analyzer.analyze_market_competition()
        figures.append(market_fig)

        # Log main results
        logger.info("\nTop 5 market opportunities:")
        top_opportunities = market_analysis.nlargest(5, 'opportunity_score')
        for cat in top_opportunities.index:
            logger.info(f"\n{cat}:")
            logger.info(f"- Score: {market_analysis.loc[cat, 'opportunity_score']:.2f}")
            logger.info(f"- Average rating: {market_analysis.loc[cat, 'avg_rating']:.2f}")
            logger.info(f"- Competition: {market_analysis.loc[cat, 'num_apps']:,} apps")
            logger.info(f"- % paid apps: {market_analysis.loc[cat, 'paid_perc']:.1f}%")

        return MarketAnalysis(
            category_stats=category_stats,
            price_stats=price_stats,
            market_analysis=market_analysis,
            figures=figures
        )

    except Exception as e:
        logger.error(f"Error during exploratory analysis: {str(e)}")
        raise

# Execute exploratory analysis
market_analysis = perform_exploratory_analysis(apps_clean, reviews_clean)

# Display charts
for fig in market_analysis.figures:
    fig.show()

5. Performance analysis and key metrics¶

The fifth cell transitions from an initial descriptive exploration to a more in-depth investigation of relationships between variables and temporal trends. While previous blocks identified market opportunities based on categories, we now explore how different metrics influence each other and how the market has evolved over time.

Two NamedTuple classes are defined as typed containers for results: CorrelationResults collects the Pearson and Spearman correlation matrices along with their visualizations, while TimeMetrics contains aggregated temporal data and the corresponding chart.

The main class PlayStoreAnalyzer is implemented as @dataclass and in its __post_init__ method performs a necessary filtering operation:

self.apps_df = self.apps_df[self.apps_df['Category'] != '1.9'].copy()

This line removes from the dataset a clearly anomalous observation where the "Category" column contains the value '1.9', which does not correspond to any legitimate Google Play Store category. Upon closer examination of the dataset, this record shows data misalignment, with values shifted between columns. This row contains impossible values such as a rating of "19", the string "Free" in the installations column, and other clearly misplaced elements. I decided to completely remove this data that would compromise the reliability of subsequent analyses.

After this initial cleanup, the prepare_data() method transforms raw data into meaningful analytical metrics through several essential operations:

  1. processes temporal information by converting the "Last Updated" column to datetime format, calculating the number of days since the last update relative to the most recent date in the dataset, and extracting the update year into a new column;

  2. applies logarithmic transformations (np.log1p()) to the installations and size columns. This technique is particularly useful for handling asymmetric distributions or those with wide ranges of values, allowing better visualization and analysis of relationships that would otherwise be difficult to interpret.

The code also calculates market metrics such as global market share (dividing each app's installations by total installations) and share within the category, using pandas' transform function which maintains the original dimensionality of the dataframe.

Additionally, the method provides for integrating sentiment data into the main dataset through _merge_sentiment_data(). This process occurs in three steps: first it joins the review data with app categories via an inner join, then aggregates the data to calculate the average polarity (positivity/negativity) and average subjectivity of reviews for each app, and finally joins this aggregated data to the main dataframe with a left join.

I want to specify again that sentiment metrics (Sentiment_Polarity and Sentiment_Subjectivity) are not actually used in the visualizations and correlation analyses for the reasons expressed previously.

 

The analyze_correlations() method calculates two types of correlation coefficients that provide complementary perspectives:

  • Pearson correlation (r = Σ[(x_i - x̄)(y_i - ȳ)] / √[Σ(x_i - x̄)² · Σ(y_i - ȳ)²] - where x_i and y_i are individual observations and x̄ and ȳ are the variable means) measures linear relationships between variables, quantifying how much two variables tend to increase or decrease together proportionally. It is particularly effective for linear relationships, but can be misleading with non-linear relationships or significant outliers;

  • Spearman correlation (ρ = 1 - (6 · Σd_i²) / [n(n² - 1)] - where d_i is the difference between the ranks of corresponding observations and n is the number of observations) captures monotonic relationships - that is, when one variable increases, the other tends to always change in the same direction - even when they are not strictly linear. Based on the ranks of variables rather than their absolute values, it is more robust against outliers and non-normal distributions, providing a more complete view when analyzing data that often present asymmetric distributions.

The calculation of correlations is performed using pandas' built-in functionalities:

data = self.apps_df[metrics.keys()]
pearson_corr = data.corr(method='pearson').round(3)
spearman_corr = data.corr(method='spearman').round(3)

The round(3) method rounds the correlation coefficients to three decimal places to improve readability.

Using both metrics offers a potentially more complete view of the relationships between variables, allowing identification of both linear and non-linear patterns in the data.

To visualize these correlations, the code creates two main representations. The first is a heatmap that compares the Pearson and Spearman correlation matrices side by side, using a color scale from red (negative correlations) to blue (positive correlations) to visually highlight the strength and direction of relationships. The second is a scatter plot matrix that shows the relationships between specific pairs of variables. In the scatter plot analyzing the relationship between rating and days since last update, the code adds slight random noise to the rating values ("jitter") to avoid point overlaps, then calculates a moving average to highlight the general trend, and limits the y-axis to the 95th percentile to avoid extreme values compressing the visualization. In this second visualization, the point colors are based on the apps' ratings, with a scale ranging from red (low ratings) to green (high ratings), allowing easy identification of patterns in the relationships between variables.

 

The analyze_temporal_trends() method offers an evolutionary perspective of the app market, aggregating data by update year to understand how key metrics have changed over time. The analysis begins by calculating the average price of paid apps for each year, excluding free apps to avoid distorting the results. The data is then aggregated by year, calculating the average rating, average size, average installations, and app count for each period.

The visualization of temporal evolution adopts a two-subplot approach that organizes metrics into conceptually coherent groups:

  • the upper subplot presents product metrics, intrinsic characteristics of apps: average rating (reflecting user-perceived quality), average price (indicating monetization strategies), and average size (which may correlate with feature complexity and richness). These three parameters are visualized with different colored lines for distinction: green for rating, red for price, and blue for size;

  • the lower subplot visualizes market metrics, indicators of app performance and diffusion: average installations (represented by an orange line showing the average popularity of apps) and total number of apps (displayed as semi-transparent gray bars, providing context on market size evolution).

For the lower subplot, a logarithmic scale is implemented on the Y-axis, transforming exponential increments into linear increments: for example, the transitions from 1000 to 10000, from 10000 to 100000, and from 100000 to 1000000 appear as equal distances in the chart. This approach allows simultaneous visualization of apps with very different popularities and appreciation of proportional rather than absolute changes, revealing relative growth patterns that would otherwise remain hidden in a traditional linear scale.

Finally, the _format_trend_value() method customizes the format based on the metric type: it transforms large installation numbers into more readable formats (K for thousands, M for millions), formats prices with the dollar symbol, adds "MB" to sizes, and appropriately handles missing values with "N/A" notation.

 

The analyze_play_store() function coordinates the analysis process. Its implementation follows several phases:

  1. the function creates a PlayStoreAnalyzer instance, passing the app and review dataframes. This instance contains all the specialized logic for the different analyses. It also prepares two empty data structures: a figures list to collect the generated visualizations and a results dictionary to store the numerical results.

  2. it performs the correlation analysis by calling the analyze_correlations() method within a try-except block for error handling. The results are stored in the results dictionary under the key 'correlations' and the generated figures are added to the figures list. An important feature is the identification and recording of statistically relevant correlations: the function iterates through the Pearson and Spearman correlation matrices, extracting and logging only relationships with correlation coefficients greater than 0.3 in absolute value.

  3. it then proceeds with temporal trend analysis by calling analyze_temporal_trends(). Again, the results and visualizations are stored. A notable element is the detailed calculation of changes between the first and last available years in the dataset: for average rating, the percentage change is calculated, while for other metrics such as average size, average installations, and number of apps, absolute values at the beginning and end of the analyzed period are recorded.

  4. all results and figures are collected in the results dictionary, which is returned as the function output. The try-except block wrapping the entire implementation provides robustness to the analysis. When an exception occurs, the code catches the error by entering the except block, records it in the logging system through logger.error(), and re-raises it with the raise instruction without parameters.

This approach:

  • ensures that no error goes unnoticed thanks to logging;

  • maintains the complete trace of the original error (type, message, and stack trace) and allows the calling code to implement any recovery strategies.

Unlike handling that would "swallow" the error, this technique ensures that problems in the data or analysis process are correctly identified and can be addressed appropriately.

 

Interpretation of Results¶

 

Correlation Comparison (Pearson vs Spearman)¶

Analyzing the matrices, several interesting correlations emerge:

 

Days since last update vs Log size

Pearson: -0.35 / Spearman: -0.33

This negative correlation indicates that more recently updated apps tend to have larger sizes. It could reflect a tendency of developers to release updates that enrich the app with new features, consequently increasing its size.

 

Log size vs Log installations

Pearson: 0.34 / Spearman: 0.35

This positive correlation suggests that larger apps tend to have more installations. This could indicate that users are willing to download heavier apps when they offer more features or content, or that more successful apps tend to expand over time.

 

Days since last update vs Log installations

Pearson: -0.19 / Spearman: -0.33

It's interesting to note how this correlation is stronger according to Spearman than Pearson, suggesting a monotonic, but not perfectly linear relationship. Apps updated more frequently tend to have more installations, presumably because regular updates keep the app relevant and attractive to users.

 

Price vs other metrics

Both matrices show weak correlations between price and other metrics, with values rarely exceeding ±0.20. This suggests that price has a limited relationship with other parameters, which could indicate that pricing strategies are determined by factors other than popularity or technical characteristics of apps.

 

Rating vs other metrics

Rating also shows generally weak correlations with other metrics, with the exception of a slight negative correlation with days since last update (-0.13 in Pearson, -0.19 in Spearman), suggesting that more recently updated apps tend to have slightly better ratings.

 

In summary, we can deduce that:

  1. update frequency seems to be an important factor correlated with app success, suggesting the importance of regular maintenance;

  2. app size has a positive correlation with installations, indicating that users might prefer apps richer in features;

  3. price and rating seem to be influenced by more complex factors not directly captured by the other metrics analyzed.

It is important to note, however, that while the correlations are not extremely strong, they still provide useful indications about relationships between app characteristics.

 

Scatter plots of key relationships¶

These scatter plots greatly enrich the understanding of the correlations analyzed previously:

  1. they visually confirm the nature of relationships, showing not only their strength, but also their shape;

  2. they highlight non-linear patterns, particularly visible in the Rating vs Days since last update graph;

  3. they allow identification of clusters and distributions that simple correlations do not capture.

The strategic implications that emerge from this visual analysis reinforce what has already been observed:

  • regular updates seem to be fundamental to maintaining high ratings and, potentially, to increasing installations

  • the positive relationship between app size and their diffusion suggests that users appreciate apps richer in features;

  • there is an evident interconnection between high rating and greater number of installations, which may indicate how perceived quality can influence the popularity of an app.

 

Temporal evolution of key metrics¶

The chart offers a dynamic perspective on the app market from 2010 to 2018, divided into two panels: the upper one shows product metrics (rating, price, size) and the lower one market metrics (installations and number of apps).

In the upper panel we observe some trends in app characteristics:

  • Average size (blue line): shows the most significant growth, going from values close to zero in 2010 to about 25 MB in 2018. This constant and substantial increase reflects the evolution of mobile device hardware capabilities and the growing complexity of modern apps, which incorporate increasingly advanced features, higher quality graphical elements, and multimedia content.

  • Average price (red line): presents a more irregular trend with a notable surge between 2016 and 2017 (from about $5 to $22), followed by a slight decrease in 2018. This peak could indicate a change in monetization strategies or the entry into the market of premium apps in specific categories. The subsequent decrease suggests a possible market adjustment towards more competitive prices.

  • Average rating (green line): remains remarkably stable around the value 4 during the entire period, with minimal variations. This stability is interesting considering the significant changes in other metrics and suggests that, despite market evolution, user expectations regarding quality and developers' ability to meet them have remained relatively constant.

 

In the lower panel, displayed on a logarithmic scale:

  • Average installations (orange line): show a gradual increase in the analyzed period, with a more marked acceleration in recent years, reaching over a million average installations per app in 2018. This trend suggests increasing smartphone penetration and greater user engagement with apps.

  • Number of apps (gray bars): highlights exponential growth, reaching several thousand apps in 2018. This growth testifies to the explosion of the app ecosystem and intensifying competition. It is important to note that the logarithmic scale of the graph visually attenuates this growth which is actually much more pronounced than it appears.

 

Analyzing the two panels jointly, interesting relationships emerge:

  1. increasing size and complexity: the constant increase in the average size of apps has occurred in parallel with the growth of average installations, suggesting that users have not been discouraged by heavier apps, probably because they offer richer and more satisfying experiences and greater functionality;

  2. stability of perceived quality: despite the increase in complexity and size of apps, the average rating has remained stable.

  3. price dynamics: the significant increase in prices between 2016 and 2017, followed by a slight decrease, could reflect more aggressive monetization attempts in a mature market, followed by competitive adjustments.

  4. market saturation: The exponential growth in the number of apps, combined with the more moderate increase in average installations, suggests increasing competition for user attention.

 

The analyses carried out in this code fragment and the related visualizations offer valuable indications for the launch of a hypothetical new app:

  • users seem to accept larger apps, provided they offer proportional value;

  • the market has become extremely competitive, with thousands of apps competing for user attention;

  • the stability of ratings suggests that quality expectations are well established;

  • pricing strategies require particular attention, considering the significant changes observed in recent years.

In [ ]:
logger = logging.getLogger(__name__)

class CorrelationResults(NamedTuple):
    pearson: pd.DataFrame
    spearman: pd.DataFrame
    figures: List[go.Figure]

class TimeMetrics(NamedTuple):
    data: pd.DataFrame
    figure: go.Figure

@dataclass
class PlayStoreAnalyzer:

    apps_df: pd.DataFrame
    reviews_df: Optional[pd.DataFrame] = None
    max_workers: int = 4

    def __post_init__(self):
        self.apps_df = self.apps_df[self.apps_df['Category'] != '1.9'].copy()
        self.prepare_data()

    def prepare_data(self) -> None:

        # Temporal metrics
        self.apps_df['Last Updated'] = pd.to_datetime(self.apps_df['Last Updated'])

        # Use the most recent date in the dataset as reference
        max_date = self.apps_df['Last Updated'].max()
        self.apps_df['Days_Since_Update'] = (
            max_date - self.apps_df['Last Updated']
        ).dt.days

        self.apps_df['Update_Year'] = self.apps_df['Last Updated'].dt.year

        # Logarithmic transformations
        self.apps_df['Log_Installs'] = np.log1p(self.apps_df['Installs_Clean'])
        self.apps_df['Log_Size'] = np.log1p(self.apps_df['Size_MB'])

        # Market metrics
        total_installs = self.apps_df['Installs_Clean'].sum()
        self.apps_df['market_share'] = self.apps_df['Installs_Clean'] / total_installs
        self.apps_df['category_share'] = self.apps_df.groupby('Category')['Installs_Clean'].transform(
            lambda x: x / x.sum()
        )

        # Merge with sentiment if available
        if self.reviews_df is not None:
            self._merge_sentiment_data()

    def _merge_sentiment_data(self) -> None:
        sentiment_data = self.reviews_df.merge(
            self.apps_df[['App', 'Category']],
            on='App',
            how='inner'
        )

        app_sentiment = sentiment_data.groupby(['App', 'Category']).agg({
            'Sentiment_Polarity': 'mean',
            'Sentiment_Subjectivity': 'mean'
        }).reset_index()

        self.apps_df = self.apps_df.merge(
            app_sentiment,
            on=['App', 'Category'],
            how='left'
        )

    @staticmethod
    def _create_correlation_heatmap(corr_matrix: pd.DataFrame,
                                  title: str) -> go.Figure:
        return go.Figure(
            data=go.Heatmap(
                z=corr_matrix.values,
                x=corr_matrix.columns,
                y=corr_matrix.index,
                colorscale='RdBu',
                zmin=-1,
                zmax=1,
                text=corr_matrix.values.round(2),
                texttemplate='%{text}',
                textfont={"size": 10}
            ),
            layout=dict(
                title=title,
                height=600,
                font=dict(family="Arial", size=12)
            )
        )

    def _create_scatter_matrix(self, metrics: Dict[str, str]) -> go.Figure:
        scatter_pairs = [
            ('Rating', 'Log_Installs'),
            ('Rating', 'Log_Size'),
            ('Log_Size', 'Log_Installs'),
            ('Rating', 'Days_Since_Update')
        ]

        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=[
                f'{metrics[x]} vs {metrics[y]}'
                for x, y in scatter_pairs
            ]
        )

        # Creating a color scale for days
        colorscale = [
            [0, 'green'],      # 0-30 days
            [0.1, 'lightgreen'],  # 30-90 days
            [0.2, 'yellow'],   # 90-180 days
            [0.4, 'orange'],   # 6 months-1 year
            [0.6, 'red'],      # 1-2 years
            [1.0, 'darkred']   # >2 years
        ]

        for idx, (x, y) in enumerate(scatter_pairs):
            row = idx // 2 + 1
            col = idx % 2 + 1

            hover_text = [
                f"App: {app}<br>" +
                f"Category: {cat}<br>" +
                f"{metrics[x]}: {val_x:.2f}<br>" +
                f"{metrics[y]}: {val_y:.2f}<br>" +
                f"Price: ${price:.2f}<br>" +
                f"Installations: {inst:,.0f}<br>" +
                f"Size: {size:.1f}MB<br>" +
                f"Days since last update: {days:.0f}"
                for app, cat, val_x, val_y, price, inst, size, days in zip(
                    self.apps_df['App'],
                    self.apps_df['Category'],
                    self.apps_df[x],
                    self.apps_df[y],
                    self.apps_df['Price_Clean'],
                    self.apps_df['Installs_Clean'],
                    self.apps_df['Size_MB'],
                    self.apps_df['Days_Since_Update']
                )
            ]

            fig.add_trace(
                go.Scatter(
                    x=self.apps_df[x],
                    y=self.apps_df[y],
                    mode='markers',
                    marker=dict(
                        size=4,
                        opacity=0.6,
                        color=self.apps_df['Days_Since_Update'],
                        colorscale=colorscale,
                        colorbar=dict(
                            title='Days since<br>last update',
                            ticktext=['0', '30', '90', '180', '365', '730', '>730'],
                            tickvals=[0, 30, 90, 180, 365, 730, 1000]
                        ) if idx == 1 else None,
                        cmin=0,
                        cmax=1000
                    ),
                    name=f'{metrics[x]} vs {metrics[y]}',
                    hovertemplate="%{text}<extra></extra>",
                    text=hover_text,
                    showlegend=False
                ),
                row=row, col=col
            )

            # Updating axes
            fig.update_xaxes(title=metrics[x], row=row, col=col, gridcolor='lightgray', showgrid=True)
            fig.update_yaxes(title=metrics[y], row=row, col=col, gridcolor='lightgray', showgrid=True)

        fig.update_layout(
            title='Scatter plots of main relationships',
            height=800,
            width=1000,
            showlegend=False,
            title_x=0.5,
            hovermode='closest',
            plot_bgcolor='white',
            margin=dict(t=100, l=50, r=50, b=50)
        )

        return fig

    def analyze_correlations(self) -> Tuple[pd.DataFrame, pd.DataFrame, List[go.Figure]]:
        metrics = {
            'Rating': 'Rating',
            'Price_Clean': 'Price',
            'Log_Installs': 'Log installations',
            'Log_Size': 'Log size',
            'Days_Since_Update': 'Days since last update'
        }

        # Calculate correlations
        data = self.apps_df[metrics.keys()]
        pearson_corr = data.corr(method='pearson').round(3)
        spearman_corr = data.corr(method='spearman').round(3)

        # Rename columns
        for corr_matrix in [pearson_corr, spearman_corr]:
            corr_matrix.columns = metrics.values()
            corr_matrix.index = metrics.values()

        # Create graphs
        figures = []

        # Correlation heatmap
        heatmap_fig = make_subplots(
            rows=1, cols=2,
            subplot_titles=('Pearson Correlations', 'Spearman Correlations'),
            horizontal_spacing=0.15
        )

        # Add Pearson heatmap
        heatmap_fig.add_trace(
            go.Heatmap(
                z=pearson_corr.values,
                x=pearson_corr.columns,
                y=pearson_corr.index,
                colorscale='RdBu',
                zmin=-1,
                zmax=1,
                text=pearson_corr.values.round(2),
                texttemplate='%{text}',
                textfont={"size": 10}
            ),
            row=1, col=1
        )

        # Add Spearman heatmap
        heatmap_fig.add_trace(
            go.Heatmap(
                z=spearman_corr.values,
                x=spearman_corr.columns,
                y=spearman_corr.index,
                colorscale='RdBu',
                zmin=-1,
                zmax=1,
                text=spearman_corr.values.round(2),
                texttemplate='%{text}',
                textfont={"size": 10}
            ),
            row=1, col=2
        )

        heatmap_fig.update_layout(
            title='Pearson vs Spearman correlations comparison',
            title_x=0.5,
            height=600,
            width=1500,  # Increased to 1500
            font=dict(family="Arial", size=12),
            margin=dict(t=100, l=100, r=100, b=50)
        )

        figures.append(heatmap_fig)

        # Scatter matrix
        scatter_pairs = [
            ('Rating', 'Log_Installs'),
            ('Rating', 'Log_Size'),
            ('Log_Size', 'Log_Installs'),
            ('Rating', 'Days_Since_Update')
        ]

        scatter_fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=[
                f'{metrics[x]} vs {metrics[y]}'
                for x, y in scatter_pairs
            ],
            horizontal_spacing=0.15,
            vertical_spacing=0.15
        )

        # Creating a color scale based on rating
        colorscale = [
            [0, 'red'],       # Rating 1
            [0.25, 'orange'], # Rating 2
            [0.5, 'yellow'],  # Rating 3
            [0.75, 'lightgreen'], # Rating 4
            [1, 'green']      # Rating 5
        ]

        for idx, (x, y) in enumerate(scatter_pairs):
            row = idx // 2 + 1
            col = idx % 2 + 1

            hover_text = [
                f"App: {app}<br>" +
                f"Category: {cat}<br>" +
                f"{metrics[x]}: {val_x:.2f}<br>" +
                f"{metrics[y]}: {val_y:.2f}<br>" +
                f"Rating: {rating:.1f}<br>" +
                f"Price: ${price:.2f}<br>" +
                f"Installations: {inst:,.0f}<br>" +
                f"Size: {size:.1f}MB<br>" +
                f"Days since last update: {days:.0f}"
                for app, cat, val_x, val_y, rating, price, inst, size, days in zip(
                    self.apps_df['App'],
                    self.apps_df['Category'],
                    self.apps_df[x],
                    self.apps_df[y],
                    self.apps_df['Rating'],
                    self.apps_df['Price_Clean'],
                    self.apps_df['Installs_Clean'],
                    self.apps_df['Size_MB'],
                    self.apps_df['Days_Since_Update']
                )
            ]

            if y == 'Days_Since_Update':
                # Add jitter to rating to avoid overlapping
                jittered_x = self.apps_df[x] + np.random.normal(0, 0.05, len(self.apps_df))

                # Calculate moving average
                rating_range = np.arange(1, 5.1, 0.1)
                days_mean = []
                for r in rating_range:
                    mask = (self.apps_df[x] >= r - 0.2) & (self.apps_df[x] < r + 0.2)
                    mean_val = self.apps_df.loc[mask, y].mean()
                    days_mean.append(mean_val)

                scatter_fig.add_trace(
                    go.Scatter(
                        x=jittered_x,
                        y=self.apps_df[y],
                        mode='markers',
                        marker=dict(
                            size=3,
                            opacity=0.3,
                            color=self.apps_df['Rating'],
                            colorscale=colorscale,
                            colorbar=dict(
                                title='Rating',
                                ticktext=['1', '2', '3', '4', '5'],
                                tickvals=[1, 2, 3, 4, 5]
                            ) if idx == 1 else None,
                            cmin=1,
                            cmax=5
                        ),
                        name=f'{metrics[x]} vs {metrics[y]}',
                        hovertemplate="%{text}<extra></extra>",
                        text=hover_text,
                        showlegend=False
                    ),
                    row=row, col=col
                )

                # Add moving average line
                scatter_fig.add_trace(
                    go.Scatter(
                        x=rating_range,
                        y=days_mean,
                        mode='lines',
                        line=dict(color='black', width=2),
                        name='Moving average',
                        showlegend=False
                    ),
                    row=row, col=col
                )

                # Update layout for this specific subplot
                scatter_fig.update_xaxes(
                    title=metrics[x],
                    row=row,
                    col=col,
                    gridcolor='lightgray',
                    showgrid=True,
                    range=[0.5, 5.5]
                )

                # Use 95th percentile for y-axis
                y_max = self.apps_df[y].quantile(0.95)
                scatter_fig.update_yaxes(
                    title=metrics[y],
                    row=row,
                    col=col,
                    gridcolor='lightgray',
                    showgrid=True,
                    range=[0, y_max]
                )
            else:
                scatter_fig.add_trace(
                    go.Scatter(
                        x=self.apps_df[x],
                        y=self.apps_df[y],
                        mode='markers',
                        marker=dict(
                            size=4,
                            opacity=0.6,
                            color=self.apps_df['Rating'],
                            colorscale=colorscale,
                            colorbar=dict(
                                title='Rating',
                                ticktext=['1', '2', '3', '4', '5'],
                                tickvals=[1, 2, 3, 4, 5]
                            ) if idx == 1 else None,
                            cmin=1,
                            cmax=5
                        ),
                        name=f'{metrics[x]} vs {metrics[y]}',
                        hovertemplate="%{text}<extra></extra>",
                        text=hover_text,
                        showlegend=False
                    ),
                    row=row, col=col
                )

                scatter_fig.update_xaxes(
                    title=metrics[x],
                    row=row,
                    col=col,
                    gridcolor='lightgray',
                    showgrid=True
                )
                scatter_fig.update_yaxes(
                    title=metrics[y],
                    row=row,
                    col=col,
                    gridcolor='lightgray',
                    showgrid=True
                )

        scatter_fig.update_layout(
            title='Scatter plots of main relationships',
            height=800,
            width=1500,  # Increased to 1500
            showlegend=False,
            title_x=0.5,
            hovermode='closest',
            plot_bgcolor='white',
            margin=dict(t=100, l=50, r=50, b=50)
        )

        figures.append(scatter_fig)

        return pearson_corr, spearman_corr, figures

    def analyze_temporal_trends(self) -> TimeMetrics:
        # Preliminary price check
        avg_price = self.apps_df.groupby('Update_Year').agg({
            'Price_Clean': lambda x: x[x > 0].mean() if len(x[x > 0]) > 0 else np.nan
        })

        # Efficient temporal metrics aggregation
        time_metrics = self.apps_df.groupby('Update_Year').agg({
            'Rating': 'mean',
            'Size_MB': 'mean',
            'Installs_Clean': 'mean',
            'App': 'count'
        }).round(2)

        time_metrics['Price_Clean'] = avg_price['Price_Clean']

        # Optimized figure creation with subplot
        fig = make_subplots(
            rows=2,
            cols=1,
            row_heights=[0.6, 0.4],
            vertical_spacing=0.12,
            subplot_titles=(
                'Product Metrics (Rating, Price, Size)',
                'Market Metrics (Installations and number of apps)'
            )
        )

        # Subplot 1 trace configuration
        traces_subplot1 = [
            ('Rating', 'Average Rating', '#2ECC40'),
            ('Price_Clean', 'Average Price ($)', '#FF4136'),
            ('Size_MB', 'Average Size (MB)', '#0074D9')
        ]

        # Add subplot 1 traces
        for col, name, color in traces_subplot1:
            data = time_metrics[col].fillna(0)
            hover_text = [
                f"Year: {year}<br>{name}: {self._format_trend_value(val, col.lower())}"
                for year, val in zip(time_metrics.index, time_metrics[col])
            ]

            fig.add_trace(
                go.Scatter(
                    x=time_metrics.index,
                    y=data,
                    name=name,
                    line=dict(color=color, width=2),
                    hovertemplate="%{text}<extra></extra>",
                    text=hover_text
                ),
                row=1,
                col=1
            )

        # Add number of apps bars to subplot 2
        hover_text_app = [
            f"Year: {year}<br>Number of apps: {self._format_trend_value(val, 'app')}"
            for year, val in zip(time_metrics.index, time_metrics['App'])
        ]

        fig.add_trace(
            go.Bar(
                x=time_metrics.index,
                y=time_metrics['App'],
                name='Number of apps',
                marker_color='#AAAAAA',
                opacity=0.3,
                width=0.5,
                hovertemplate="%{text}<extra></extra>",
                text=hover_text_app
            ),
            row=2,
            col=1
        )

        # Add installations line to subplot 2
        hover_text_inst = [
            f"Year: {year}<br>Average Installations: {self._format_trend_value(val, 'installations')}"
            for year, val in zip(time_metrics.index, time_metrics['Installs_Clean'])
        ]

        fig.add_trace(
            go.Scatter(
                x=time_metrics.index,
                y=time_metrics['Installs_Clean'],
                name='Average Installations',
                line=dict(color='#FF851B', width=2),
                hovertemplate="%{text}<extra></extra>",
                text=hover_text_inst
            ),
            row=2,
            col=1
        )

        # Layout optimization
        fig.update_layout(
            title={
                'text': 'Temporal evolution of key metrics',
                'y': 0.98,
                'x': 0.5,
                'xanchor': 'center',
                'yanchor': 'top'
            },
            height=900,
            showlegend=True,
            legend=dict(
                orientation='h',
                yanchor='bottom',
                y=1.05,
                xanchor='center',
                x=0.5,
                bgcolor='rgba(255, 255, 255, 0.8)',
                bordercolor='lightgray',
                borderwidth=1
            ),
            plot_bgcolor='white',
            hovermode='x unified',
            margin=dict(t=120, b=50, l=50, r=50)
        )

        # Axes optimization
        for row in [1, 2]:
            fig.update_xaxes(
                title='Year',
                showgrid=True,
                gridwidth=1,
                gridcolor='lightgray',
                row=row
            )

        fig.update_yaxes(
            title='Value',
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            row=1,
            col=1
        )

        fig.update_yaxes(
            title='Number',
            showgrid=True,
            gridwidth=1,
            gridcolor='lightgray',
            type='log',
            row=2,
            col=1
        )

        return TimeMetrics(time_metrics, fig)

    def _format_trend_value(self, value: float, metric_type: str) -> str:
        if pd.isna(value):
            return "N/A"

        if metric_type == "installations":
            if value >= 1e6:
                return f"{value/1e6:.1f}M"
            elif value >= 1e3:
                return f"{value/1e3:.1f}K"
            return f"{value:.0f}"
        elif metric_type == "size":
            return f"{value:.1f}MB"
        elif metric_type == "price":
            return f"${value:.2f}"
        elif metric_type == "rating":
            return f"{value:.2f}"
        elif metric_type == "app":
            return f"{int(value):,}"
        return str(value)


# This function should be outside the class
def analyze_play_store(apps_df: pd.DataFrame, reviews_df: Optional[pd.DataFrame] = None) -> Dict[str, Any]:
    logger.info("=== GOOGLE PLAY STORE ANALYSIS ===")

    # Initializing analyzer
    analyzer = PlayStoreAnalyzer(apps_df, reviews_df)
    figures = []
    results = {}

    try:
        # 1. Correlation analysis
        logger.info("\n1. Analysis of correlations between metrics")
        pearson_corr, spearman_corr, corr_figures = analyzer.analyze_correlations()
        results['correlations'] = {'pearson': pearson_corr, 'spearman': spearman_corr}
        figures.extend(corr_figures)

        # Log correlations
        for method, corr_matrix in [('Pearson', pearson_corr), ('Spearman', spearman_corr)]:
            logger.info(f"\nSignificant {method} correlations (|corr| > 0.3):")
            for i in range(len(corr_matrix.columns)):
                for j in range(i+1, len(corr_matrix.columns)):
                    corr = corr_matrix.iloc[i, j]
                    if abs(corr) > 0.3:
                        logger.info(
                            f"{corr_matrix.index[i]} vs {corr_matrix.columns[j]}: {corr:.3f}"
                        )

        # 2. Temporal analysis
        logger.info("\n2. Analysis of temporal trends")
        time_metrics, time_fig = analyzer.analyze_temporal_trends()
        results['temporal'] = time_metrics
        figures.append(time_fig)

        # Log main trends
        first_metrics = time_metrics.iloc[0]
        last_metrics = time_metrics.iloc[-1]

        rating_change = ((last_metrics['Rating'] - first_metrics['Rating']) /
                        first_metrics['Rating'] * 100)

        logger.info("\nMain trends:")
        logger.info(
            f"Average rating: {rating_change:+.1f}% change "
            f"(from {first_metrics['Rating']:.2f} to {last_metrics['Rating']:.2f})"
        )
        logger.info(
            f"Average size: from {first_metrics['Size_MB']:.1f}MB to "
            f"{last_metrics['Size_MB']:.1f}MB"
        )

        def format_installs(val):
            return f"{val/1e6:.1f}M" if val >= 1e6 else f"{val/1e3:.1f}K"

        logger.info(
            f"Average installations: from {format_installs(first_metrics['Installs_Clean'])} to "
            f"{format_installs(last_metrics['Installs_Clean'])}"
        )

        logger.info(
            f"Number of apps: from {int(first_metrics['App']):,} to "
            f"{int(last_metrics['App']):,}"
        )

        if pd.notna(first_metrics['Price_Clean']) and pd.notna(last_metrics['Price_Clean']):
            logger.info(
                f"Average price: from ${first_metrics['Price_Clean']:.2f} to "
                f"${last_metrics['Price_Clean']:.2f}"
            )
        else:
            logger.info("Average price: data not available")

        results['figures'] = figures
        return results

    except Exception as e:
        logger.error(f"Error during Play Store analysis: {str(e)}")
        raise


# Run the analysis - this should also be outside the class
analysis_results = analyze_play_store(apps_clean, reviews_clean)

# Show figures
for fig in analysis_results['figures']:
    fig.show()

6. Competitive and technical analysis¶

The code begins by defining a NamedTuple class called MarketStructureResults that serves as a container for the market structure analysis results. This class contains two elements:

  • market_df: a pandas dataframe with the calculated metrics for each category;

  • figure: a go.Figure object from Plotly for visualizing the results.

This structure clearly separates the data (the dataframe) from their visual representation (the figure), allowing both to be manipulated independently.

 

The MarketAnalyzer class is responsible for examining the market structure for each app category. In the constructor, we notice the removal of rows where the Category value is '1.9', as seen in previous cells.

The first static method, calculate_market_concentration, implements a version of the Herfindahl-Hirschman Index (HHI). This index measures the level of market concentration within a category. The HHI is calculated by summing the squares of each app's market shares (based on the number of installations) within the category. Higher values (close to 1) indicate a market dominated by few apps, while lower values indicate a more fragmented market with greater competition. The function includes error handling through try-except blocks and checks for edge cases (such as empty categories), returning np.nan when necessary. The np.clip function ensures that the result is between 0 and 1.

The calculate_category_stability method evaluates how stable or volatile a category is. This index is constructed by considering three dimensions of variability:

  1. rating variability: represents the standard deviation of ratings normalized with respect to the maximum possible value (5);

  2. update variability: measures how diverse the app update times are in the category;

  3. installation variability: calculates the relative standard deviation (coefficient of variation) of the number of installations after logarithmic transformation.

The final stability index is an inverse weighted average of these variabilities, where greater weights are assigned to rating variability (40%), followed by update and installation variability (30% each). A higher value (close to 1) indicates a more stable category, while lower values indicate greater volatility.

The calculate_direct_payment_propensity method evaluates the propensity for direct payment and is therefore important for evaluating this type of monetization. This index estimates how willing users in a category are to pay for apps through three components:

  1. paid app ratio: percentage of paid apps in the category (40% of the weight);

  2. price level: average price of paid apps normalized with respect to the maximum price (30% of the weight);

  3. rating difference between paid and free apps: transformed into a value between 0 and 1, where higher values indicate that paid apps are better rated than free ones (30% of the weight).

A higher value of the index (close to 1) suggests a category where users are more likely to pay for apps, potentially making direct monetization strategies more feasible.

The analyze_market_structure method coordinates the entire analysis process, iterating over each category, calculating the metrics discussed above and other basic statistics, and building a dataframe with the results.

It then creates a scatter visualization using Plotly, where:

  • the X-axis represents the market concentration index;

  • the Y-axis represents the stability index;

  • the size of the points represents the propensity for direct payment;

  • the color represents the average rating.

This multidimensional representation allows visually identifying categories with desirable characteristics, such as high stability, low concentration (less competition), and high propensity for direct payment.

 

The TechnicalAnalyzer class focuses on the technical aspects of apps in each category.

The main method, analyze_development_patterns, performs a comprehensive analysis of the technical aspects for each category. It analyzes:

  • the distribution of Android versions supported by apps in each category;

  • statistics on app size (mean, median, standard deviation);

  • update frequency (updates per year).

The code creates two visualizations:

  1. _create_android_distribution_plot: a stacked horizontal bar chart showing the percentage distribution of Android versions by category;

  2. _create_technical_details_plot: a box plot illustrating the distribution of app sizes by category, highlighting outliers.

Specifically, the _create_android_distribution_plot method orders the categories by calculating a weighted average and multiplying each Android version by its distribution percentage, thus allowing categories to be ordered from those supporting more recent versions to those supporting older versions.

The _create_technical_details_plot method visualizes the distribution of app sizes by category, showing median, quartiles, and outliers. The use of customdata to insert app names in the outlier points is an interesting design touch that allows users to directly identify which apps are significantly larger than the average in their category.

The perform_category_analysis function manages the entire competitive analysis process, handling both analyses (market and technical) and then calculating an overall score for each category. The final score is a weighted combination of:

  1. market score (80% of the final score), which includes:
  • category stability (40% of the market score);

  • inverse of market concentration (30% of the market score) - the lower the concentration, the higher this component;

  • payment propensity (30% of the market score).

  1. technical score (20% of the final score), based primarily on normalized update frequency.

The calculation of these scores effectively represents a scoring model that favors categories with stable markets, competition not dominated by few players, and with users willing to pay for applications. The technical aspect has a lower weight but favors categories with more frequent updates, indicative of a more active and dynamic ecosystem.

Finally, the function identifies the top 5 most promising categories based on the final score and generates a brief report.

 

The final part of the code block executes the perform_category_analysis function, passing the apps_clean and reviews_clean dataframes, which were preprocessed in previous blocks, as parameters. The function returns two objects: a results dictionary containing the numerical results of the analysis and a figures list with the generated visualizations.

This implementation iterates through the list of figures with a for loop and displays them sequentially using Plotly's show() method. Note that the numerical results contained in the results dictionary are not explicitly displayed in the notebook, but only recorded in the logs through the logger.info() calls within the function.

 

Interpretation of results¶

 

Market structure by category¶

This scatter plot maps app categories based on two crucial dimensions: the market concentration index (X-axis) and the stability index (Y-axis). The size of the circles represents the propensity for direct payment, while the color indicates the average rating, with a gradient from red for lower ratings to green for higher ones.

In the upper left part, we find ENTERTAINMENT, characterized by high stability (0.77) and low concentration (about 0.05). This position describes a balanced market, where no app excessively dominates and conditions remain relatively constant over time. The rather large size of the circle also suggests a good propensity of users to pay for entertainment content.

ART_AND_DESIGN stands out for its position in the upper right part of the graph, combining high stability (about 0.78) with moderate concentration (about 0.25). The intense green color indicates a very high average rating. This configuration represents a stable market where, despite the presence of some dominant players, quality seems to be rewarded and there might be opportunities for new entrants.

In the central part of the graph, we find categories such as MEDICAL and PERSONALIZATION, represented by large circles indicating a high propensity of users to pay for these types of apps. However, their relatively low position on the stability axis (about 0.45-0.50) suggests more volatile markets where conditions can change rapidly.

WEATHER, HEALTH_AND_FITNESS, and NEWS_AND_MAGAZINES show relatively high concentration, but variable circle sizes, representing market niches where few dominant players control much of the installations. These categories might require more marked differentiation strategies for new entrants.

In the lower part of the graph are categories such as FAMILY, GAME, and TOOLS, characterized by low stability and low concentration. These are highly fragmented and volatile markets with numerous competitors, where success might be more difficult to maintain in the long term.

 

Android version distribution by category¶

In this graph, a technical analysis is presented through stacked horizontal bars showing the percentage distribution of Android versions supported by apps in each category. More recent versions are represented in green, while older ones in red, creating a chromatic gradient that facilitates the identification of technological trends.

The adoption of more recent Android versions varies significantly among categories. ENTERTAINMENT, FOOD_AND_DRINK, and TRAVEL_AND_LOCAL stand out for the higher percentage of apps that support the latest versions. This might reflect the need to leverage the advanced features of new versions of the operating system.

At the opposite extreme, categories such as LIBRARIES_AND_DEMO, BOOKS_AND_REFERENCE, and COMMUNICATION show significant percentages of apps that still support older versions of Android (2.0-3.0). This more conservative approach might derive from the need to maintain compatibility with a wider range of devices, or from the presence of historical apps that are not frequently updated.

The GAME category shows a relatively uniform distribution across different Android versions, probably reflecting the need to reach the widest possible audience, from older to newer devices, given the importance of user volume in this highly competitive segment.

This analysis provides some indications for developers on how important multi-version support is in different categories. For some, rapid adoption of the latest technologies might represent a competitive advantage, while for others a more inclusive approach might maximize the potential user base.

 

App size distribution by category¶

This graph uses box plots to visualize the distribution of app sizes (in MB) for each category. Each box plot shows the median (central line), the interquartile range (the "box" representing the central 50% of the data), and outliers (individual points that significantly deviate from the main distribution).

GAME, FAMILY, and MEDICAL emerge as the categories with the highest median sizes, ranging between 20 and 40 MB. The wider boxes in these categories also indicate greater variability in sizes.

Categories such as LIBRARIES_AND_DEMO, PRODUCTIVITY, and COMMUNICATION tend to have more contained median sizes, generally between 5 and 15 MB. This lightness suggests a greater focus on efficiency and functionality rather than elaborate multimedia content.

A recurring aspect in almost all categories is the presence of significant outliers, visualized as points above the boxes. These represent apps that deviate considerably from the central tendency of their category, reaching in some cases sizes of 80-100 MB. Particularly noteworthy are the outliers in the GAME, FAMILY, and MEDICAL categories.

The variability within categories provides further interpretive insights. GAME shows the widest interquartile range, indicating the greatest heterogeneity in sizes. This characteristic probably reflects the diversity of available game genres, from simple puzzles that require few resources to complex 3D games with elaborate graphic assets. In contrast, categories such as WEATHER, PRODUCTIVITY, and LIBRARIES_AND_DEMO present more compact boxes, suggesting greater homogeneity in sizes and, potentially, greater standardization in development practices.

 

In summary, from the three visualizations present in this part of the code, some relevant strategic considerations emerge for those intending to develop a new app.

The scoring model used identifies ENTERTAINMENT as the most promising category (0.785), followed by FOOD_AND_DRINK (0.764), ART_AND_DESIGN (0.740), DATING (0.735), and SHOPPING (0.733).

Category final_score market_score technical_score
ENTERTAINMENT 0.785 0.733 0.993
FOOD_AND_DRINK 0.764 0.709 0.984
ART_AND_DESIGN 0.740 0.680 0.978
DATING 0.735 0.669 1.000
SHOPPING 0.733 0.672 0.976

These results derive from a formula that assigns 80% of the weight to the market score (combination of stability, low concentration, and payment propensity) and 20% to the technical score (based primarily on update frequency). It is important to note that this methodology can generate rankings that seem to contrast with what can be observed in the graphs. For example, ART_AND_DESIGN appears particularly favorable in the market structure graph (high stability and excellent rating), but is only third in the final ranking.

This explains why categories like DATING, with a perfect technical score (1.000), rank high in the final ranking despite not optimal market characteristics. This discrepancy highlights the importance of considering both the numerical results and the graphical visualizations for a complete understanding of market opportunities.

The visualizations thus offer an empirical basis for navigating the Google Play Store ecosystem, but it is advisable to critically evaluate the weights assigned to different factors based on one's own strategic priorities.

In [ ]:
logger = logging.getLogger(__name__)

class MarketStructureResults(NamedTuple):
    market_df: pd.DataFrame
    figure: go.Figure

class MarketAnalyzer:

    def __init__(self, apps_df: pd.DataFrame, max_workers: int = 4):
        self.apps_df = apps_df[apps_df['Category'] != '1.9'].copy()
        self.max_workers = max_workers

    @staticmethod
    def calculate_market_concentration(category_data: pd.DataFrame) -> float:
        try:
            if len(category_data) == 0:
                return np.nan

            total_installs = category_data['Installs_Clean'].sum()
            if total_installs == 0:
                return np.nan

            market_shares = category_data['Installs_Clean'] / total_installs
            hhi = (market_shares ** 2).sum()

            return np.clip(hhi, 0, 1)
        except:
            return np.nan

    @staticmethod
    def calculate_category_stability(category_data: pd.DataFrame) -> float:
        try:
            if len(category_data) < 2:
                return np.nan

            # Calculate rating variability
            rating_var = (category_data['Rating'].std() / 5
                         if not category_data['Rating'].isna().all() else 0)

            # Calculate update variability
            update_var = (category_data['Days_Since_Update'].std() / 365
                         if not category_data['Days_Since_Update'].isna().all() else 0)

            # Calculate installation variability
            installs = np.log1p(category_data['Installs_Clean'])
            install_var = (installs.std() / installs.mean()
                          if not installs.isna().all() and installs.mean() > 0 else 0)

            # Calculate overall stability
            stability = 1 - (rating_var * 0.4 + update_var * 0.3 + install_var * 0.3)
            return np.clip(stability, 0, 1)
        except:
            return np.nan

    @staticmethod
    def calculate_direct_payment_propensity(category_data: pd.DataFrame) -> float:
        try:
            if len(category_data) == 0:
                return np.nan

            # Calculate paid app ratio
            paid_apps = category_data[category_data['Price_Clean'] > 0]
            paid_ratio = len(paid_apps) / len(category_data)

            # Calculate price level
            max_price = category_data['Price_Clean'].max()
            price_level = (paid_apps['Price_Clean'].mean() / max_price
                         if len(paid_apps) > 0 and max_price > 0 else 0)

            # Calculate rating difference paid vs free
            if len(paid_apps) > 0:
                free_rating = category_data[category_data['Price_Clean'] == 0]['Rating'].mean()
                paid_rating = paid_apps['Rating'].mean()
                paid_vs_free = (paid_rating - free_rating + 5) / 10
            else:
                paid_vs_free = 0

            # Calculate overall propensity
            propensity = (paid_ratio * 0.4 + price_level * 0.3 + paid_vs_free * 0.3)
            return np.clip(propensity, 0, 1)
        except:
            return np.nan

    def analyze_market_structure(self) -> MarketStructureResults:

        # List to collect results
        results = []

        # Analysis by category
        for category in self.apps_df['Category'].unique():
            category_data = self.apps_df[self.apps_df['Category'] == category]

            results.append({
                'Category': category,
                'num_apps': len(category_data),
                'concentration': self.calculate_market_concentration(category_data),
                'stability': self.calculate_category_stability(category_data),
                'payment_propensity': self.calculate_direct_payment_propensity(category_data),
                'avg_rating': category_data['Rating'].mean(),
                'total_installs': category_data['Installs_Clean'].sum()
            })

        # Create results DataFrame
        market_df = pd.DataFrame(results).set_index('Category')

        # Create chart
        fig = go.Figure(data=[
            go.Scatter(
                x=market_df['concentration'],
                y=market_df['stability'],
                mode='markers+text',
                text=market_df.index,
                textposition='top right',
                textfont=dict(size=9),
                marker=dict(
                    size=market_df['payment_propensity'].fillna(0) * 50 + 20,
                    color=market_df['avg_rating'].fillna(market_df['avg_rating'].mean()),
                    colorscale='RdYlGn',
                    colorbar=dict(title='Average Rating'),
                    showscale=True
                ),
                hovertemplate=(
                    "<b>%{text}</b><br>" +
                    "Concentration: %{x:.3f}<br>" +
                    "Stability: %{y:.3f}<br>" +
                    "Payment Propensity: %{marker.size:.3f}<br>" +
                    "Rating: %{marker.color:.2f}<br>" +
                    "<extra></extra>"
                )
            )
        ])

        # Update layout
        fig.update_layout(
            title=dict(
                text='Market Structure by Category',
                x=0.5,
                y=0.95,
                xanchor='center',
                yanchor='top',
                font=dict(size=20)
            ),
            xaxis_title='Market Concentration Index',
            yaxis_title='Stability Index',
            height=800,
            showlegend=False,
            plot_bgcolor='white',
            margin=dict(t=150),
            annotations=[
                dict(
                    text='Size = Direct Payment Propensity',
                    xref='paper',
                    yref='paper',
                    x=0.5,
                    y=1.08,
                    xanchor='center',
                    yanchor='middle',
                    showarrow=False,
                    font=dict(size=11)
                ),
                dict(
                    text='Color = Average Rating',
                    xref='paper',
                    yref='paper',
                    x=0.5,
                    y=1.04,
                    xanchor='center',
                    yanchor='middle',
                    showarrow=False,
                    font=dict(size=11)
                )
            ]
        )

        return MarketStructureResults(market_df, fig)


class TechnicalAnalyzer:

    def __init__(self, apps_df: pd.DataFrame, max_workers: int = 4):
        self.apps_df = apps_df[apps_df['Category'] != '1.9'].copy()
        self.max_workers = max_workers

    def analyze_development_patterns(self) -> Tuple[pd.DataFrame, List[go.Figure]]:
        """Analyzes development patterns for each category"""
        tech_results = []
        android_distributions = {}

        # Calculate Android version distributions by category
        categories = self.apps_df['Category'].unique()
        for category in categories:
            category_data = self.apps_df[self.apps_df['Category'] == category]

            # Android version distribution
            version_dist = category_data['Android_Ver_Clean'].value_counts()
            total_apps = len(category_data)
            version_percentages = (version_dist / total_apps * 100).round(1)
            android_distributions[category] = version_percentages

            # Technical statistics
            size_stats = {
                'mean': category_data['Size_MB'].mean(),
                'median': category_data['Size_MB'].median(),
                'std': category_data['Size_MB'].std()
            }

            days_since = category_data['Days_Since_Update'].mean()
            updates_per_year = 365 / days_since if days_since > 0 else np.nan

            tech_results.append({
                'Category': category,
                'avg_size': size_stats['mean'],
                'size_variability': size_stats['std'] / size_stats['mean'] if size_stats['mean'] > 0 else 0,
                'updates_per_year': updates_per_year,
                'num_apps': total_apps
            })

        # Create Android version matrix
        all_versions = sorted(set().union(*[dist.index for dist in android_distributions.values()]))
        android_matrix = pd.DataFrame(
            index=android_distributions.keys(),
            columns=all_versions,
            data=0.0
        )

        # Populate matrix
        for category, dist in android_distributions.items():
            for version in dist.index:
                android_matrix.loc[category, version] = dist[version]

        tech_df = pd.DataFrame(tech_results).set_index('Category')

        # Create charts
        android_fig = self._create_android_distribution_plot(android_matrix)
        tech_fig = self._create_technical_details_plot(tech_df)

        return tech_df, [android_fig, tech_fig]

    def _create_android_distribution_plot(self, android_matrix: pd.DataFrame) -> go.Figure:

        # Sort categories by average version
        weighted_avg = pd.DataFrame({
            'avg_version': sum(android_matrix[col] * float(col) for col in android_matrix.columns) / 100
        })
        android_matrix_sorted = android_matrix.loc[weighted_avg.sort_values('avg_version', ascending=False).index]

        # Color scale for versions
        n_versions = len(android_matrix_sorted.columns)
        colors = [
            f'rgb({int(255*(1-i/n_versions))}, {int(255*(i/n_versions))}, 0)'
            for i in range(n_versions)
        ]

        # Create chart
        fig = go.Figure()

        for i, version in enumerate(android_matrix_sorted.columns):
            fig.add_trace(go.Bar(
                name=f'Android {version}',
                y=android_matrix_sorted.index,
                x=android_matrix_sorted[version],
                orientation='h',
                marker_color=colors[i],
                hovertemplate=(
                    "<b>%{y}</b><br>" +
                    f"Android {version}: " + "%{x:.1f}%<br>" +
                    "<extra></extra>"
                )
            ))

        fig.update_layout(
            title=dict(
                text='Android Version Distribution by Category',
                x=0.5,
                font=dict(size=20)
            ),
            xaxis_title='Percentage of apps (%)',
            yaxis_title='Category',
            barmode='stack',
            height=800,
            showlegend=True,
            plot_bgcolor='white',
            legend=dict(
                title='Android Version',
                yanchor="top",
                y=0.99,
                xanchor="left",
                x=1.02
            ),
            margin=dict(l=200, r=150),
            bargap=0.1
        )

        fig.add_vline(x=100, line_dash="dash", line_color="gray", opacity=0.5)

        return fig

    def _create_technical_details_plot(self, tech_df: pd.DataFrame) -> go.Figure:
        categories = tech_df.index.tolist()
        fig = go.Figure()

        for category in categories:
            category_data = self.apps_df[self.apps_df['Category'] == category]
            category_sizes = category_data['Size_MB'].dropna()
            app_names = category_data.loc[category_sizes.index, 'App'].values

            fig.add_trace(go.Box(
                y=category_sizes,
                name=category,
                boxpoints='outliers',
                jitter=0.3,
                pointpos=-1.8,
                hovertemplate=(
                    "<b>%{customdata}</b><br>" +
                    "Category: %{x}<br>" +
                    "Size: %{y:.1f}MB<br>" +
                    "<extra></extra>"
                ),
                customdata=app_names
            ))

        fig.update_layout(
            title=dict(
                text='App Size Distribution by Category',
                x=0.5,
                font=dict(size=20)
            ),
            xaxis=dict(
                title='Category',
                tickangle=45,
                tickfont=dict(size=10)
            ),
            yaxis_title='Size (MB)',
            showlegend=False,
            height=700,
            margin=dict(b=150, t=150),
            plot_bgcolor='white',
            annotations=[dict(
                text=('Box = interquartile range (25th-75th percentile)<br>' +
                      'Line = median<br>' +
                      'Points = outliers (significantly heavier apps)'),
                xref='paper',
                yref='paper',
                x=0.5,
                y=1.1,
                showarrow=False,
                font=dict(size=12)
            )]
        )

        return fig

def perform_category_analysis(apps_df: pd.DataFrame, reviews_df: pd.DataFrame) -> Dict[str, Any]:
    logger.info("=== IN-DEPTH CATEGORY ANALYSIS ===\n")
    results = {}
    figures = []

    # 1. Market structure analysis
    logger.info("\n1. Market structure analysis")
    market_analyzer = MarketAnalyzer(apps_df)
    market_results = market_analyzer.analyze_market_structure()
    results['market'] = market_results.market_df
    figures.append(market_results.figure)

    # Log market results
    logger.info("\nTop 5 categories by market concentration:")
    print(market_results.market_df[['concentration', 'stability', 'avg_rating']]
          .nlargest(5, 'concentration'))

    # 2. Technical analysis
    logger.info("\n2. Technical aspects analysis")
    tech_analyzer = TechnicalAnalyzer(apps_df)
    tech_df, tech_figs = tech_analyzer.analyze_development_patterns()
    results['technical'] = tech_df
    figures.extend(tech_figs)

    # Calculate final score
    categories = market_results.market_df.index
    final_scores = pd.DataFrame(index=categories)

    # Market score (80%)
    final_scores['market_score'] = (
        market_results.market_df['stability'] * 0.4 +
        (1 - market_results.market_df['concentration']) * 0.3 +
        market_results.market_df['payment_propensity'].fillna(0) * 0.3
    )

    # Technical score (20%)
    final_scores['technical_score'] = (
        tech_df['updates_per_year'].fillna(0) / tech_df['updates_per_year'].max()
    )

    # Final score
    final_scores['final_score'] = (
        final_scores['market_score'] * 0.8 +
        final_scores['technical_score'] * 0.2
    )

    results['final_scores'] = final_scores

    # Final report
    logger.info("\nTop 5 most promising categories:")
    top_5 = final_scores.nlargest(5, 'final_score')
    for cat in top_5.index:
        logger.info(f"\n{cat}:")
        logger.info(f"- Final score: {top_5.loc[cat, 'final_score']:.3f}")
        logger.info(f"- Market score: {top_5.loc[cat, 'market_score']:.3f}")
        logger.info(f"- Technical score: {top_5.loc[cat, 'technical_score']:.3f}")
        logger.info(f"- Market details:")
        logger.info(f"  * Concentration: {market_results.market_df.loc[cat, 'concentration']:.3f}")
        logger.info(f"  * Stability: {market_results.market_df.loc[cat, 'stability']:.3f}")
        logger.info(f"  * Payment propensity: {market_results.market_df.loc[cat, 'payment_propensity']:.3f}")

    # Explicit display of top 5 categories
    print("\n======== TOP 5 MOST PROMISING CATEGORIES ========")
    print(top_5[['final_score', 'market_score', 'technical_score']].round(3))

    print("\nMETHODOLOGY NOTE:")
    print("The ranking is based on a formula that assigns:")
    print("- 80% to market score: stability (40%), low concentration (30%), payment propensity (30%)")
    print("- 20% to technical score: mainly normalized update frequency")
    print("\nThis methodology might generate results that seem to contrast with some")
    print("graphical visualizations, where categories like ART_AND_DESIGN appear more favorable.")
    print("The significant weight given to update frequency explains why categories")
    print("like DATING, with perfect technical score (1.000), rank high in the final ranking.")

    return results, figures

# Run the analysis
results, figures = perform_category_analysis(apps_clean, reviews_clean)

# Display charts
for fig in figures:
    fig.show()
                    concentration  stability  avg_rating
Category                                                
HEALTH_AND_FITNESS          0.313      0.554       4.226
ART_AND_DESIGN              0.265      0.789       4.357
NEWS_AND_MAGAZINES          0.222      0.556       4.143
PARENTING                   0.186      0.749       4.300
WEATHER                     0.180      0.493       4.231

======== TOP 5 MOST PROMISING CATEGORIES ========
                final_score  market_score  technical_score
Category                                                  
ENTERTAINMENT         0.785         0.733            0.993
FOOD_AND_DRINK        0.764         0.709            0.984
ART_AND_DESIGN        0.740         0.680            0.978
DATING                0.735         0.669            1.000
SHOPPING              0.733         0.672            0.976

METHODOLOGY NOTE:
The ranking is based on a formula that assigns:
- 80% to market score: stability (40%), low concentration (30%), payment propensity (30%)
- 20% to technical score: mainly normalized update frequency

This methodology might generate results that seem to contrast with some
graphical visualizations, where categories like ART_AND_DESIGN appear more favorable.
The significant weight given to update frequency explains why categories
like DATING, with perfect technical score (1.000), rank high in the final ranking.

Conclusion¶

The analysis of the app market on the Google Play Store has revealed a complex and continuously evolving ecosystem, while also offering valuable insights for stakeholders interested in investing in this sector.

The landscape is characterized by a marked heterogeneity in the distribution of applications, with categories such as FAMILY and GAME dominating in numerical terms, together representing about a third of the overall market. However, this concentration does not necessarily translate into limited opportunities for new entrants. On the contrary, the analysis has highlighted how less crowded categories tend to register higher average ratings, suggesting that in niche markets it is possible to emerge with quality products capable of meeting user expectations.

Monetization strategies show noteworthy patterns: the majority of paid apps (36.4%) are positioned in the $1-2.99 price range. It is interesting to note how applications with lower prices tend to receive higher ratings, while premium ones often struggle to meet the high expectations generated by their cost.

The competitive analysis has led to the identification of some potentially promising opportunities. The scoring model, which integrates market and technical metrics, has identified ENTERTAINMENT, FOOD_AND_DRINK, and ART_AND_DESIGN as the most favorable categories. These categories combine good market stability with relatively low concentration and, in the case of ART_AND_DESIGN, very positive ratings.

The correlations between different metrics reveal some strategic insights: update frequency emerges as a crucial factor related to app success, with a clear relationship between recent updates, better ratings, and a higher number of installations. The size of the application also shows a positive correlation with installations, suggesting that users appreciate apps with more features despite the greater space required.

A more granular analysis reveals that categories differ significantly in their technical characteristics. Entertainment, food, and travel apps tend to adopt more recent versions of Android more quickly, while categories such as books and communication maintain compatibility with older versions.

These findings suggest some strategic considerations. Less crowded niches offer opportunities to emerge with quality products, while more competitive sectors require more marked differentiation strategies. Frequent updates and constant maintenance of the app seem to be essential to maintain high levels of user satisfaction and grow in installations. The propensity to pay varies significantly between categories, suggesting different monetization strategies depending on the chosen sector.

 

Strategic recommendations¶

MEDICAL emerges as a particularly favorable category when considering all the dimensions analyzed. With a good average rating of 4.2, it ranks in the high range of user satisfaction, while maintaining moderate competition with 439 apps. Its relatively high average size (20-40 MB) indicates functionally rich applications, while the very high payment propensity (22.6% of paid apps) suggests users willing to invest in quality solutions. Although it presents medium market stability, this dynamism can represent an opportunity for new entrants with sufficiently innovative ideas. Correlation analysis also suggests that in this category, regular updates and adequate app size are particularly rewarding in terms of installations.

ART_AND_DESIGN offers an optimal balance between market opportunities and long-term sustainability. The category stands out for its excellent average rating (4.3) and high market stability (0.78), indicating satisfied users and relatively constant competitive conditions over time. The moderate concentration (0.25) suggests that, despite the presence of some dominant players, there is room for new entrants with quality products. The temporal evolution also shows a constant growth trend in both installations and average app sizes, signaling an expanding market. The distribution of Android versions indicates a good adoption of recent versions, allowing the implementation of new features without sacrificing the user base.

ENTERTAINMENT completes the trio of recommended opportunities, distinguishing itself by the unique combination of high stability (0.77) and very low concentration (0.05). This configuration describes a balanced market where no app excessively dominates, creating a favorable environment for new entrants. The category has shown steady growth in average installations in recent years, with a good propensity for users to pay for quality content. The analysis of Android version distribution also reveals one of the highest percentages of adoption of recent versions, indicating a technologically up-to-date audience and potentially more receptive to more cutting-edge solutions. The observed correlations confirm that in this category, update frequency is particularly rewarding, with a strong relationship between recent updates and high ratings.

 

Streamlit dashboard¶

To facilitate the exploration of this data, I have created an interactive dashboard in Streamlit. This tool allows you to dynamically visualize the analyzed metrics, filter data by category, and generate personalized insights based on configurable parameters.